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Chairman:  D.  G.  Childers 

Major  Department:  Electrical  Engineering 

The  research  investigated  aspects  of  vocal  excitation 
characteristics  on  the  production  of  voice  quality.  The  knowledge 
gained  was  used  to  derive  design  characteristics  for  improving 

voice  excitations  for  speech  synthesis.  Several  voice  types  were 

investigated:  modal,  vocal  fry,  falsetto  and  breathy  voices.  The 
research  project  included  source-feature  extraction,  source  modeling 
for  speech  synthesis,  and  perceptual  evaluation. 

Three  categories  of  analysis  techniques  were  developed  to  extract 
source-related  features  from  speech  and  electroglottographic  (EGG) 
signals.  These  included:  (1)  a new  inverse  filtering  method  for 

estimating  the  glottal  waves  from  speech  signals,  (2)  methods  for 

measuring  the  source  features  (the  spectral  slope,  the  amount  of 

turbulent  noise,  and  the  temporal  energy  distribution)  directly  from 
speech  signals,  and  (3)  EGG  waveform  analysis  methods  for  predicting 
the  vocal  fold  contact  phenomena.  The  analysis  results  showed  a 

great  diversity  in  temporal  and  spectral  characteristics  for  glottal 
excitations  of  different  voice  types. 
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Based  on  the  analysis  results,  four  major  factors  were  found  to 
be  important  in  characterizing  the  glottal  excitations  for  different 
voice  types:  the  glottal  pulse  width,  the  glottal  pulse  skewness,  the 
abruptness  of  glottal  closure,  and  the  turbulent  noise  component. 
The  significance  of  these  factors  for  voice  synthesis  was  studied. 
Existing  voice  source  models  were  evaluated  based  on  their  capability 
to  control  these  factors.  Then,  an  improved  voice  source  model  was 
proposed  and  evaluated. 

Using  the  new  source  model  with  a cascade  formant  synthesizer, 
synthetic  voice  samples  were  produced  by  systematically  varying  source 
parameters  that  correspond  to  selected  glottal  factors.  Listening 
tests  were  conducted  to  evaluate  the  auditory  effects.  The  results 
showed  that  the  sensation  of  vocal  effort  is  closely  related  to  the 
glottal  pulse  skewness.  A high  pulse  skewness  results  in  tense  voice 
quality,  while  a low  pulse  skewness  results  in  lax  voice  quality.  The 
perceptual  evaluation  also  revealed  that  the  degree  of  breathiness  is 
strongly  correlated  with  the  noise-to-harmonic  ratio  over  the  frequency 
interval  above  2 KHz. 

In  summary,  the  major  contributions  of  this  dissertation  were 
the  development  of  new  speech  analysis  and  synthesis  techniques  for 
designing  new  systems  for  producing  natural-sounding  synthetic  speech 
with  a desired  voice  quality. 
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CHAPTER  1 


INTRODUCTION 


Voice  Quality  and  Phonation  Types 

The  quality  of  voice  is  usually  referred  to  as  the  total  auditory 
impression  the  listener  experiences  upon  hearing  the  speech  of  another 
talker.  There  is  no  generally  accepted  definition  of  voice  quality 
and  the  term  quality  itself  has  been  used  in  different  contexts. 
A phonetician  might  use  quality  in  the  context  of  articulatory 
differences;  for  example,  the  vowels  in  "hot"  and  "cot"  differ  in  their 
"vowel  quality."  But  a singer  might  use  quality  in  terms  of  vocal 
registers  which  are  related  to  the  laryngeal  vibration.  Further, 
quality  might  be  defined  in  such  terms  as  breathy,  hoarse,  harsh,  etc. 
In  this  study,  we  investigate  the  voice  quality  variations  caused  by 
the  changes  of  vocal  fold  vibratory  patterns.  Our  major  concerns  here 
are  the  vocal  excitation  characteristics  and  the  perceptual  correlates 
of  different  types  of  phonation. 

According  to  Laver  and  Hanson  [1981],  there  are  six  major  types 
of  phonation,  namely,  modal  voice,  vocal  fry,  falsetto,  breathiness, 
harshness,  and  whisper.  In  the  literature  on  the  human  voice,  modal 
voice,  vocal  fry,  and  falsetto  are  often  termed  as  three  different 
"vocal  registers,"  which  are  related  to  the  production  of  voice  pitch 
and  the  pitch  range  of  an  individual.  Although  other  terms  have  been 
used  to  describe  register  divisions  (e.g.,  Morner  et  al.  [1963]  listed 
107  labels),  researchers  [e.g.,  Boone,  1971;  Hollien,  1974;  van  Deinse, 
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1981]  generally  agree  that  a particular  vocal  register  is  characterized 
by  a certain  pattern  of  vocal  fold  vibration,  with  the  vocal  folds 
approximated  in  a similar  way  throughout  a particular  pitch  range.  And 
once  this  pitch  range  reaches  its  maximum  limit,  the  vocal  folds  adjust 
to  a new  approximation  contour,  which  produces  an  abrupt  change  in 
voice  quality. 

Breathiness  and  harshness  are  considered  as  voice  disorders,  which 
may  arise  from  various  functional  or  organic  pathologies.  These  voice 
problems  are  usually  related  to  faulty  muscular  tension  in  various 
sites  of  the  vocal  mechanism,  and  were  characterized  as  hyperfunctional 
voice  disorders  [Boone,  1971].  At  the  level  of  the  larynx, 
hyperfunctional  voice  disorders  are  related  to  difficulties  in  optimal 
approximation  of  the  vocal  folds  during  phonation.  Laxity  of  vocal 
fold  approximation  generally  produces  an  escape  of  air,  perceived  as 
breathiness.  On  the  other  hand,  over-adduction  of  the  vocal  folds 
during  phonation  causes  harsh  voice. 

In  the  present  study,  the  term  "voice  types"  will  be  used  to  refer 
to  the  three  vocal  registers  (modal,  falsetto,  and  vocal  fry)  as  well 
as  to  breathy  and  harsh  voices. 

Research  Issues 

Speech  researchers  have  long  been  interested  in  the  phonation 
problem  of  the  human  voice  production,  aiming  at  developing  improved 
speech  analysis  and  synthesis  techniques  [e.g.,  Flanagan,  1958,  1972; 
Fant,  1960,  1979,  1980].  To  fully  understand  the  phonation  problem, 
three  aspects,  namely,  the  physiological,  the  acoustic,  and  the 
perceptual  aspects,  of  the  problem  should  be  addressed.  For  the 
physiological  aspect,  we  need  to  understand  the  activities  of  the 
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laryngeal  muscles  and  the  vocal  fold  vibration  patterns  during 
phonation.  For  the  acoustic  aspect,  we  need  to  extract  the  acoustic 
correlates  that  characterize  different  types  of  phonation.  For  the 
perceptual  aspect,  we  need  to  determine  those  factors  the  human 
auditory  system  uses  to  discriminate  different  voice  qualities.  We 
also  need  to  understand  the  interrelationships  between  these  three 
aspects.  The  present  research  represents  a step  toward  the  ultimate 
objective  to  understand  the  various  aspects  of  phonation. 

The  purpose  of  this  research  was  to  investigate  vocal  excitation 
characteristics  and  to  design  voice  source  models  for  natural  sounding 
speech  synthesis  with  any  desired  voice  quality.  We  investigated  the 
nature  and  extent  of  the  glottal  excitation  variations  by  analyzing 
speech  and  electroglottographic  (EGG)  signals  of  various  voice  types 
(modal,  vocal  fry,  falsetto  and  breathy  voices).  A new  source  model 
for  speech  synthesis  was  developed,  which  contains  factors  that  are 
important  in  characterizing  various  voice  qualities.  The  auditory 
effects  of  selected  glottal  features  were  studied  by  systematically 
varying  appropriate  source  parameters  in  speech  synthesis  experiments. 
We  believe  that  the  knowledge  gained  in  this  research  about  vocal 
excitation  characteristics  and  voice  quality  variations  will  provide 
useful  information  for  future  development  of  natural  sounding  speech 
synthesizers  and  reliable  speaker-independent  speech  recognition 
systems. 

Figure  1-1  summarizes  the  three  aspects  of  the  phonation  problem, 
the  basic  research  strategy,  and  the  specific  aims  of  the  present 
research.  A brief  introduction  of  phonatory  biomechanics  is  included 
in  the  next  section  to  provide  a physiological  and  aerodynamic 
background  for  this  report.  This  is  followed  by  a review  of  previous 
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Figure  1-1.  The  research  problem,  strategy  and  objectives. 
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research  on  the  acoustic  and  perceptual  correlates  of  various  voice 
types. 


Phonatory  Biomechanics 

The  most  widely  accepted  theory  of  phonation  is  the  myoelastic- 
aerodynamic  theory,  presented  by  van  den  Berg  [1958].  According  to 
this  theory,  the  vocal  fold  vibrations  are  produced  by  an  interaction 
between  muscular  and  elastic  forces  within  the  vocal  folds  and  the 
aerodynamic  forces  that  impinge  on  the  vocal  folds.  Air  pressure 
from  the  lungs  forces  the  vocal  folds  open,  initiating  a phonatory 
vibration.  At  a maximum  open  position  in  the  vibratory  cycle,  the  air 
stream  passing  through  the  glottis  reaches  a velocity  that  causes  an 
intraglottal  pressure  drop  due  to  the  so-called  Bernoulli  effect.  The 
vocal  folds  are  forced  to  close,  beginning  at  the  interior  part  of  the 
glottis.  This  is  due  to  the  Bernoulli  effect  as  well  as  the  inherent 
elasticity  of  vocal  folds.  Following  glottal  closure  the  air  stream  is 
momentarily  interrupted  and  the  subglottal  pressure  builds  up  against 
the  closed  vocal  folds.  When  the  subglottal  pressure  is  sufficiently 
great  to  overcome  the  closing  force,  the  vocal  folds  begin  to  part. 
This  separation  begins  at  the  interior  part  of  the  glottis.  The  cycle 
starts  over  again.  The  word  "aerodynamic"  implies  that  the  vocal  folds 
are  activated  by  the  air  stream  from  the  lungs  rather  than  by  nerve 
impulses.  "Myoelastic"  refers  to  the  way  in  which  the  muscles  (myo-) 
change  their  elasticity  and  tension  to  effect  changes  in  the  frequency 
of  vibration. 

Researchers  studied  the  vibratory  movements  of  the  vocal  folds 
during  phonation  by  using  indirect  observation  methods,  such  as  X-ray 
laminagraphy  [e.g.,  Allen  and  Hollien,  1973],  laryngeal  stroboscopy 
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[e.g.,  Hirano,  1981],  ultrahigh-speed  laryngeal  cinematography  [e.g., 
Childers  et  al.,  1983],  and  indirect  signal  measurement  devices,  such 
as  electroglottography  (EGG),  [e.g.,  Childers  and  Krishnamurthy , 1985], 
photoglot tography  (PGG)  [e.g.,  Kitzing,  1982],  and  ultrasound 
glottography  (UGG)  [e.g.,  Hamlet,  1981].  Different  types  of  phonation 
were  found  to  be  characterized  by  distinctive  vocal  fold  vibratory 
patterns . 

In  modal  register,  the  vocal  folds  appear  thick  and  rounded  in  the 
frontal  projection.  By  means  of  lateral  X-ray  techniques,  Damste  et 
al.  [1968]  observed  that  the  length  of  the  vocal  folds  systematically 
increases  as  the  fundamental  frequency  of  phonation  is  increased. 
By  using  X-ray  laminagraphy , Hollien  and  Colton  [1969]  observed  that 
the  thickness  of  the  vocal  folds  systematically  decreases  as  the 
fundamental  frequency  of  phonation  is  increased.  The  vocal  fold 
vibratory  pattern  of  a modal  phonation  is  characterized  by  a moderate 
frequency,  wide  lateral  excursions,  and  complete  closure  of  the  glottis 
during  about  one-third  of  the  entire  pitch  period  [Hollien,  1974]. 

In  vocal  fry  register,  the  arytenoid  cartilages  are  tightly 
pressed  together,  so  that  the  vocal  folds  can  vibrate  only  at  the 
anterior  end  [Ladefoged,  1975].  The  vocal  fold  vibratory  pattern  is 
characterized  by  a low  frequency,  small  lateral  excursions,  and  a long 
closed  glottal  phase  [Hollien,  1974].  Using  high-speed  cinematography, 
Moore  and  von  Leden  [1958]  and  Timcke  et  al.  [1959]  observed  a double 
vibration  of  the  vocal  folds  followed  by  a prolonged  period  of  closed 
phase  during  vocal  fry.  Also  by  using  high-speed  cinematography, 
Vhitehead  el  at.  [1984]  reported  that  single,  double,  and  triple 
openings/closings  patterns  are  all  possible  for  a vocal  fry  phonation. 
Allen  and  Hollien  [1973]  observed  that  the  ventricular  space  in  vocal 
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fry  register  is  smaller  than  in  modal  register,  indicating  ventricular 
fold  impingement. 

In  falsetto  register,  the  vocal  folds  are  stretched  in  the 
longitudinal  direction,  and  the  vertical  cross-section  of  the  edges 
of  the  vocal  folds  are  extremely  thin  [Hollien  et  al.,  1968].  The 
posterior  cartilaginous  portion  is  joined  tightly  so  that  there  is 
little  or  no  posterior  vibration.  The  vibratory  excursion  in  the 
lateral  direction  is  also  restricted  by  the  stretched  folds.  The  vocal 
fold  vibratory  pattern  of  falsetto  register  is  characterized  by  a 
high  frequency,  small  lateral  excursions,  and  a just  momentary  or 
even  missing  contact  between  the  vocal  folds  [Hollien,  1974].  Judson 
and  Weaver  [1965]  have  described  the  falsetto  as  an  effect  produced 
by  strong  contraction  of  the  internal  fibers  of  the  thyro-ary tenoid 
muscles,  with  the  external  fibers  relatively  relaxed. 

Breathy  voice  is  characterized  by  an  incomplete  and  lax  vocal 
fold  approximation  during  phonation  [Boone,  1971].  Because  of  the  low 
muscular  effort,  the  lessened  glottal  resistance  leads  to  a higher 
rate  of  airflow  than  in  modal  voice.  Breathy  voice  may  be  produced 
with  the  glottis  fairly  open  at  the  posterior  end,  or  it  may  be 
produced  with  a narrow  opening  extending  over  nearly  the  whole  length 
of  the  vocal  folds  [Ladefoged,  1975].  During  breathy  phonation,  the 
vocal  folds  do  not  actually  come  into  complete  contact,  but  instead 
simply  "flap"  in  the  airstream. 

Harsh  voice  is  characterized  by  aperiodic  vocal  fold  vibration 
[Coleman,  1960;  Moore,  1962;  Uendahl,  1963,  1966].  Researchers 
[Kaplan,  1960;  Zemlin,  1964;  Boone  1971;  Laver.  1980]  generally  agree 
that  excessive  laryngeal  tension  is  the  physiological  correlate  of 
harshness.  Kaplan  [1960]  reported  that  the  vocal  folds  are  drawn 
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too  tightly  together  during  the  phonation  of  a harsh  voice.  Van  den 
Berg  [1955]  reported  that  when  harshness  becomes  very  severe,  the 
ventricular  folds  become  involved  in  phonation  by  pressing  down  on  the 
upper  surface  of  the  true  vocal  folds. 

Previous  Research 

Historically,  many  researchers  have  classified  vocal  registers 
along  the  frequency  continuum.  For  example,  Hollien  and  Michel  [1968] 
reported  the  phonational  ranges  of  12  males  and  11  females.  For  male 
speakers,  the  vocal  fry  had  the  mean  low  frequency  limit  of  24  Hz  and 
the  mean  high  frequency  limit  of  52  Hz.  Modal  register  had  a mean 
range  of  94  to  287  Hz,  and  falsetto  had  a mean  range  of  275  to  634  Hz. 
The  mean  frequency  range  for  females  in  vocal  fry  was  18  to  46  Hz, 
in  the  modal  register  144  to  538  Hz,  and  in  falsetto  495  to  1131  Hz. 
Thus,  these  three  vocal  registers  appear  to  occupy  a specific  range 
of  frequencies,  although  the  exact  frequency  limits  vary  within  and 
across  gender.  Moreover,  among  Hollien  and  Michel's  subjects,  5 out 
of  12  males  and  7 out  of  11  females  displayed  an  overlap  between  the 
frequency  ranges  of  the  modal  and  falsetto  registers.  McGlone  and 
Brown  [1969]  also  reported  data  in  which  some  subjects  displayed  an 
overlap  between  the  two  registers.  These  results  suggested  that  vocal 
registers  can  not  be  classified  solely  by  their  positions  on  the 
frequency  continuum.  Other  parameters,  physiological,  perceptual,  or 
acoustic,  seem  to  be  required  for  differentiating  vocal  registers. 

Wendahl  et  al.  [1963]  suggested  that  the  primary  criterion  that 
must  be  met  for  the  perception  of  vocal  fry  is  that  the  vocal  tract  be 
highly  damped  between  glottal  excitations.  Coleman  [1963]  established 
that  vocal  fry  is  perceived  when  the  sound  wave  decays  by  approximately 
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43  dB  between  excitation  pulses.  Michel  and  Hollien  [1968]  studied  the 
perceptual  effects  of  vocal  fry  and  harsh  voice  and  concluded  that, 
perceptually,  clinical  harshness  and  vocal  fry  are  distinct  phonatory 
entities.  By  means  of  ultrahigh-speed  laryngeal  filming,  Hollien  et 
al.  [1977]  observed  that  vocal  fry  exhibits  a short  open  time  (less 
than  25%  of  the  pitch  period)  and  a long  closed  time  during  each  pitch 
period.  Also  by  means  of  ultrahigh-speed  laryngeal  filming,  Whitehead 
et  al.  [1984]  reported  that  during  vocal  fry  phonation,  the  vibratory 
pattern  of  the  vocal  folds  may  be  characterized  by  a single  opening/ 
closing  as  well  as  multiple  openings/closings  during  an  individual 
cycle. 

Colton  [1969,  1972,  1973b]  and  Colton  and  Hollien  [1972,  1973] 
reported  a series  of  experiments  in  which  they  studied  the  perceptual 
and  acoustic  correlates  of  modal  and  falsetto  registers.  Their  results 
showed  that  listeners  can  correctly  place  phonations  in  modal  and 
falsetto  registers  into  the  proper  categories  solely  on  the  basis  of 
voice  quality.  They  also  found  that:  1)  the  intensity  of  fundamental 
frequency  in  falsetto  register  is  much  stronger  than  that  for  modal 
register;  and  2)  the  number  of  harmonic  partials  in  a falsetto  spectrum 
is  much  less  than  that  in  a modal  spectrum.  (A  harmonic  partial  was 
considered  measurable  if  it  possessed  energy  at  least  2 dB  above  the 
noise  level,  according  to  Colton's  [1969]  definition.)  Using  combined 
photo-  and  electroglottographic  recordings,  Kitzing  [1982]  studied  the 
vocal  fold  vibratory  pattern  of  various  vocal  registers.  His  results 
showed  that  the  glottograms  of  falsetto  register  were  characterized  by 
increased  open  time  and  decreased  opening  to  closing  time  ratio  (known 
as  speed  quotient)  during  each  period. 
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Monsen  and  Engebretson  [1977]  measured  the  glottal  waves  of 
various  vocal  registers  using  the  reflectionless  tube  method  proposed 
by  Sondhi  [1975].  They  concluded  that:  1)  the  spectra  of  falsetto 
glottal  waves  are  characterized  by  a steep-falling  slope  (about  -20  dB 
per  octave);  2)  for  modal  register,  the  glottal  wave  exhibits  a steep 
closing  branch,  while  in  falsetto  it  is  the  opening  portion  that  is 
more  abrupt;  3)  the  glottal  spectrum  of  vocal  fry  register  falls  off 
less  steeply  than  that  of  modal  register;  and  4)  the  glottal  waves  of 
vocal  fry  register  are  characterized  by  irregular  waveforms  and  high 
period-to-period  variations  in  fundamental  frequency. 

Coleman  [1960]  and  Moore  [1962]  reported  that  period-to-period 
fundamental  frequency  variations  are  associated  with  voices  judged  to 
be  harsh.  Wendahl  [1963,  1966]  employed  an  electrical  laryngeal  analog 
to  generate  stimuli  with  different  degrees  of  frequency  variations  on 
successive  periods.  The  results  of  his  listening  tests  showed  that: 
1)  the  degree  of  roughness  in  the  voice  was  judged  as  being  directly 
related  to  the  differences  in  period  intervals  for  successive  pitch 
periods;  and  2)  when  two  voices  with  equal  period-to-period  frequency 
variations  but  having  different  median  fundamental  frequencies  were 
given,  the  voice  with  lower  fundamental  frequency  was  judged  to  be  more 
rough. 

Using  sound  spectrographic  analysis,  Yanagihara  [1967]  found  that 
breathy  and  hoarse  voices  are  characterized  by  a high-frequency  noise 
component.  Isshiki  et  al.  [1978]  studied  the  production  of  turbulent 
noise  by  using  a hard  life-size  model  of  the  larynx.  Their  data 
suggested  that  the  acoustic  energy  of  the  glottal  turbulent  noise  is 
distributed  over  a wide  range  of  frequency  as  in  white  noise,  with  some 
accentuation  at  about  4 KHz.  Hiraoka  et  al.  [1984]  performed  spectral 
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analyses  on  speech  signals  and  concluded  that  breathy  voices  are 
characterized  by  high  interharmonic  noise  and  an  intense  fundamental 
component.  Hurme  and  Sonninen  [1985]  performed  long-term  spectral 
analysis  of  normal  and  disordered  voices  and  reported  that  breathy 
voice  has  a steeper  spectral  slope  than  a normal  voice. 

Table  1-1  summarizes  the  results  reported  in  the  previous  research 
about  the  physiological,  perceptual  and  acoustic  characteristics  of 
various  voice  types. 


Description  of  Chapters 

Chapter  2 presents  the  overall  research  scheme.  We  outline  the 
research  procedures  of  source-feature  extraction,  source  modeling,  and 
perceptual  evaluation.  The  data  collection  procedure  and  the  equipment 
used  are  also  described.  Chapter  3 describes  the  analysis  techniques 
for  extracting  source  related  features  from  speech  and  EGG  signals. 
The  analysis  results  for  various  voice  types  are  presented.  Chapter  4 
deals  with  the  problem  of  designing  voice  excitations  for  natural 
sounding  speech  synthesis.  We  discuss  the  factors  that  are  important 
for  characterizing  vocal  excitations  of  different  voice  types.  Then, 
a new  source  model  is  introduced.  Chapter  5 describes  the  procedures 
for  evaluating  the  perceptual  correlates  of  the  selected  source  factors 
and  discusses  the  results.  Chapter  6,  the  last  chapter,  summarizes 
the  conclusions  of  this  study,  and  gives  recommendations  for  future 


research. 
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Table  1-1.  Summary  of  the  physiological,  perceptual,  and  acoustic 
characteristics  of  various  voice  types. 


MODEL  VOICE 

VOCAL  FRY 

FALSETTO 

BREATHY  VOICE  HARSH  VOICE 

PHYSIOLOGICAL 

CHARACTERISTICS 

Vocal  Fold 
Length 

medium, 
increases  as 
FO  increased 

U,  2] 

short 
[2,  4) 

long 
[2,  3] 

Vocal  Fold 
Thickness 

medium , 
decreases  as 
FO  decreased 
[2,  5 , 6] 

thick 
[2,  5] 

thin 

[2,  3,  6,  7] 

longer  open 

sharp  t short 

gradual  opening 

small  vibratory  great  cycle- 

Vibratory 

time,  rapid 

pulse,  long 

and  closing 

excursion , 

to-cycle 

Pattern  closing  phase 

closed  time, 

slopes , short 

incomplete 

variation 

[2,  11) 

one  or  multiple 
opening/closing 
[2,  8,  9,  10, 
11,  12) 

or  no  closed 
phase 

[2,  11,  25] 

glottal 
closure 
[2,  3,  4] 

[13,  14,  15, 
16] 

PERCEPTUAL 

CHARACTERISTICS 

Pitch 

medium 

low 

high 

wide  range 

wide  range 

t2,  19) 

[2,  17] 

[18,  19] 

[3,  21] 

[17,  21] 

Loudness 

wide  range 

softer 

softer 

softer 

wide  range 

12) 

[2] 

[2] 

[3,  21] 

[3,  21) 

Quality 

normal 

repetitive 

flute-like  tone 

, audible 

unpleasant , 

(2,  19] 

popping  sound 
(2,  20,  21] 

sometimes 
slight  breathy 
[2,  19,  21] 

friction 
noise 
[3,  21] 

rough,  rasping 
sound 
(3,  21] 

ACOUSTIC 

CHARACTERISTICS 

Fundamental 

75-500  Hz 

1-70  Hz 

150-750  Hz 

wide  range 

wide  range 

Frequency 

[2,  18,  22) 

[2,  22] 

[2,  18,  22) 

(3) 

[3,  17] 

Vocal 

wide  range 

low 

low  to  medium 

low 

medium  to  high 

Intensity 

[2,  23] 

[2,  3] 

[2,  23] 

[2,  3,  21] 

[2,  3,  21] 

Spectral 

medium 

relatively  flat  steep 

steep 

relatively  flat 

Slope 

FO  dependent 
[2,  24] 

(2,  11] 

[2,  11,  24] 

[26] 

[26] 

Turbulent 

Noise 

low 

(27,  28,  29] 

- 

- 

high 

[27,  28,  29] 

- 

Pitch 

low 

high 

_ 

high 

high 

Perturbation 

[30  - 38] 

[HI 

[30  - 38] 

[30  - 38] 

Amplitude 

low 

_ 

_ 

high 

high 

Pe  rturbation 

(30  - 38 ] 

[30  - 38 ] 

[30  - 38 ] 

References  : 

1 = Damste  et  al.  [1968] 

2 = Hollien  [1974] 

3 = Boone  [1971] 

4 = Ladefoged  [1975] 
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Table  1-1.  (Continued) 


5 = Allen  and  Hollien  [1973] 

6 = Hollien  and  Colton  [1969] 

7 = Hollien  et  al.  [1968] 

8 = Moore  and  von  Leden  [1958] 

9 = Timcke  et  al.  [1959] 

10  = Whitehead  et  al.  [1984] 

11  = Monsen  and  Engebretson  [1977] 

12  = Hollien  et  al.  [1977] 

13  = Coleman  [1960] 

14  = Moore  [1962] 

15  = Wendahl  [1963] 

16  = Wendahl  [1964] 

17  = Michel  and  Hollien  [1968] 

18  = Colton  [1969] 

19  = Colton  and  Hollien  [1973] 

20  = Wendahl  et  al.  [1963] 

21  = Laver  [1980] 

22  = Hollien  and  Michel  [1968] 

23  = Colton  [1973a] 

24  = Colton  [1973b] 

25  = Kitzing  [1982] 

26  = Hurme  and  Sonninen  [1985] 

27  = Yanagihara  [1967] 

28  = Isshiki  et  al.  [1978] 

29  = Hiraoka  et  al.  [1984] 

30  = Deal  and  Emanuel  [1978] 

31  = Davis  [1979] 

32  = Horii  [1980] 

33  = Heiberger  and  Horii  [1982] 

34  = Hiller  et  al.  [1983] 

35  = Hirano  [1985] 

36  = Askenfelt  and  Hammarberg  [1986] 

37  = Kasuya  et  al.  [1986] 

38  = Wolfe  and  Steinfatt  [1987] 


CHAPTER  2 


RESEARCH  DESIGN 
Overview 

For  speech  scientists  and  engineers,  one  of  the  most  important 
objectives  of  studying  human  speech  production  phenomena  is  to  derive 
appropriate  parameters  for  speech  analysis  and  synthesis  models. 
This  is  the  case  in  the  present  research.  The  purpose  of  this  research 
was  to  investigate  aspects  of  vocal  excitation  characteristics  on  the 
production  of  voice  quality.  We  studied  the  nature  and  extent  of  the 
glottal  excitation  variations  introduced  by  using  different  types  of 
phonation.  The  knowledge  gained  was  used  to  develop  a voice  source 
model  for  producing  natural-sounding  synthetic  speech  with  a wide  range 
Of  voice  characteristics. 

The  block  diagram  of  the  research  scheme  is  shown  in  Figure  2-1. 
The  research  project  included  source-feature  extraction,  voice  source 
modeling,  and  perceptual  evaluation.  Speech  and  electroglottographic 
(EGG)  signals  of  various  voice  types  were  first  collected.  Analysis 
techniques  were  developed  to  extract  source-related  features  from 
the  speech  and  EGG  signals.  The  nature  and  extent  of  the  extracted 
source  features  was  analyzed  in  the  time  and  frequency  domains.  The 
significance  of  these  source  features  for  the  perception  of  naturalness 
and  voice  quality  were  also  studied.  The  knowledge  gained  was  used 
to  develop  a new  voice  source  model  that  contains  factors  important 
for  characterizing  voice  quality.  Using  the  new  source  model,  speech 


14 


VOICE  TYPE 

* Modal 

* Falsetto 

* Vocal  fry 

* Breathy 

* Harsh 


Speech 


GLOTTAL 
WAVEFORM 
ESTIMATION 
& SOURCE 
FEATURES 
EXTRACTION 


VOICE 

SYNTHESIS 


SOURCE 

MODELING 


EGG 


-> 

EGG  WAVEFORM 

ANALYSIS 

PERCEPTUAL 

EVALUATION 


I !_ 

* Source  features  that  characterize 
various  voice  qualities 

* EGG  features  for  different  types  of 
phonation 

* Auditory  effects  of  source  features 

* Source  eodel  for  natural  sounding 
voice  synthesis 

* Rules  for  synthesising  specific  voice 
qualities 


Figure  2-1.  Block  diagram  of  the  basic  research  scheme 
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samples  were  synthesized  by  systematically  varying  the  parameters  that 
correspond  to  selected  source  features.  Listening  tests  were  conducted 
to  evaluate  the  auditory  effects.  Rules  for  determining  source 
parameter  values  to  synthesize  specific  voice  qualities  were  also  an 
outcome  of  this  research. 

Experimental  Data  Base 

Subjects  and  Tasks 

Three  categories  of  subjects  were  used  in  this  research:  1) 
experienced  speech  pathologists  (DMH,  GPM),  2)  normal  subjects  (CKL, 
DRW,  HBR)  who  had  no  history  of  vocal  disorders  or  laryngeal  pathology, 
and  3)  patients  with  vocal  disorders  (JMS,  EDR,  PJB),  whose  voices  were 
evaluated  by  experienced  speech  pathologists.  Except  for  the  patient 
PJB,  all  of  the  subjects  were  male.  Female  subjects  were  excluded  from 
this  study  to  avoid  a gender  factor  in  the  research. 

The  experimental  tasks  for  each  subject  included: 

(1)  sustained  vowels  / i / and/or  /a/  using  a Electro-Voice  RE-10 
microphone, 

(2)  repeat  the  vowels  with  a Bruel  & Kjaer  model  4133  condenser 
microphone  (except  subject  GPM), 

(3)  counting  from  one  to  ten  with  comfortable  pitch  and  loudness, 

(4)  counting  from  one  to  five  with  progressive  increase  in 
loudness, 

(5)  chromatic  scale  on  "la", 

(6)  Three  sentences  ("We  were  away  a year  ago.",  "Early  one 
morning  a man  and  a woman  ambled  along  a one  mile  lane.", 
"Should  we  chase  those  cowboys?"). 
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For  the  two  speech  pathologists  and  the  normal  subject  CKL,  the  same 
set  of  tasks  were  repeatedly  phonated  in  various  voice  types  (modal, 
vocal  fry,  falsetto,  breathiness,  harshness  and  hoarseness).  Table  2-1 
lists  the  basic  data  about  the  subjects  and  the  sustained  vowels  they 
produced,  which  were  used  most  extensively  in  this  research. 

Data  Collection 

The  EGG  and  the  acoustic  speech  waveforms  were  simultaneously 
digitized  and  stored  on  a computer  disk  for  future  analyses.  Besides 
the  synchronized  speech  and  EGG  data,  for  some  of  the  subjects, 
especially  those  with  laryngeal  pathologies,  ultrahigh-speed  films  of 
the  vibrating  vocal  folds  were  taken  with  the  speech  and  EGG  waves 
being  simultaneously  recorded  on  the  films.  These  laryngeal  films  were 
used  to  assist  the  EGG  waveform  analysis.  The  technique  of  ultrahigh- 
speed  laryngeal  filming  was  described  in  Moore  [1975]  and  Childers  et 
al.  [1983]. 

The  synchronized  speech  and  EGG  signals  were  recorded  with  the 
speakers  situated  inside  an  Industrial  Acoustics  Company  (IAC)  single- 
wall sound  room.  The  speech  signal  was  obtained  with  an  Electro-Voice 
RE-10  dynamic  cardoid  microphone  and  a Bruel  & Kjaer  (B  & K)  model 
4133  condenser  microphone.  The  EGG  signal  was  from  a synchrovoice 
electroglottograph.  The  microphone  was  held  at  a fixed  distance 
(6  inch)  from  the  speaker's  lips.  The  signal  digitization  was 
accomplished  by  a Digital  Sound  Corporation  (DSC)  model  200  stereo  A/D 
and  D/A  system.  The  DSC-200  system  has  16-bit  accuracy.  The  signals 
were  digitized  with  a sampling  frequency  of  10  kHz,  with  a 5 kHz  anti- 
aliasing filter  being  used  before  digitization. 
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Table  2-1.  The  data  base  for  speech  analysis. 


SUBJECT 

SEX 

AGE 

PHONATION  TYPE 

VOWEL 

MEAN  F0 

MIC. 

DMH1 

M 

37 

modal 

/i/ 

115  Hz 

E 

DMH2 

modal 

/a/ 

123 

E 

DMH3 

modal 

/i/ 

106 

B 

DMH4 

modal 

/a/ 

109 

B 

DMH5 

falsetto 

/i/ 

306 

E 

DMH6 

falsetto 

/a/ 

312 

E 

DMH7 

falsetto 

/i/ 

345 

B 

DMH8 

falsetto 

/a/ 

345 

B 

DMH9 

slight  vocal  fry 

/i/ 

90 

E 

DMH10 

slight  vocal  fry 

/a/ 

91 

E 

DMH11 

slight  vocal  fry 

/i/ 

86 

B 

DMH12 

slight  vocal  fry 

/a/ 

84 

B 

DMH13 

severe  vocal  fry 

/i/ 

45 

E 

DMH14 

breathy  (mimic) 

/i/ 

107 

E 

DMH15 

modal  (synchro.) 

/i/ 

126 

E 

DMH16 

modal  (synchro.) 

/i/ 

126 

B 

CKL1 

M 

31 

modal 

/i/ 

118 

E 

CKL2 

modal 

/a/ 

112 

E 

CKL3 

modal 

/a/ 

155 

B 

CKL4 

falsetto 

/i/ 

263 

E 

CKL5 

falsetto 

/a/ 

238 

E 

CKL6 

falsetto 

/!/ 

213 

B 

CKL7 

falsetto 

/a/ 

210 

B 

CKL8 

severe  vocal  fry 

/i/ 

50 

E 

CKL9 

severe  vocal  fry 

/a/ 

48 

E 

CKL10 

severe  vocal  fry 

/i/ 

52 

B 

CKL11 

severe  vocal  fry 

/a/ 

46 

B 

GPM1 

M 

79 

modal 

/i/ 

127 

E 

GPM2 

breathy  (mimic) 

/i/ 

111 

E 

HBR1 

M 

48 

modal 

/i/ 

142 

E 

HBR2 

modal 

/a/ 

141 

E 

DRW1 

M 

23 

modal 

/i/ 

129 

E 

DRW  2 

modal 

/a/ 

128 

E 

JMS1 

M 

30 

breathy  (path.) 

/i/ 

213 

E 

JMS2 

breathy  (path.) 

/i/ 

200 

B 

EDR1 

M 

22 

breathy  (path.) 

/i/ 

139 

E 

EDR2 

breathy  (path. ) 

/i/ 

134 

B 

PJB1 

F 

50 

breathy  (path.) 

/i/ 

278 

E 

PJB2 

breathy  (path.) 

/i/ 

238 

B 

* Microphone 

type 

(MIC. ) : E = Electro 

-Voice 

RE-10,  B 

= B & 

* DMH15  and  DMH16  were  synchronously  collected  by  two  different  types 
of  microphones. 

* mimic  = mimicked  voice 

* path.  = pathological  voice 
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Microphone  Characteristics.  As  mentioned  above,  two  microphones 
were  used  to  measure  the  sound  pressure  waveforms  of  sustained  vowels. 
Among  them,  the  B & K 4133  condenser  microphone  has  a very  good  low- 
frequency  response.  Its  amplitude  response  is  within  ±2  dB  down  to  10 
Hz,  and  the  phase  response  is  essentially  linear.  This  feature  is 
required  when  the  speech  signal  is  used  for  glottal  wave  estimation 
(see  Chapter  3),  since  the  sought  glottal  wave  has  its  major  energy 
component  at  a low-frequency  interval  (DC  to  1 KHz).  However,  this 
feature  also  makes  the  B & K 4133  condenser  microphone  sensitive  to  the 
low-frequency  breath  and  ambient  noise,  which  may  cause  problems  in 
other  speech  waveform  analyses.  Therefore,  the  Electro-Voice  RE-10 
microphone  was  used.  This  microphone  has  a good  frequency  response  at 
frequencies  above  50  Hz,  but  cut  off  the  low-frequency  component  below 
50  Hz. 

EGG  instrumentation.  The  EGG  is  an  instrument  designed  to 
register  the  vocal  fold  vibration  as  a time-varying  signal.  The 
amplitude  variations  of  this  signal  are  generally  thought  to  be 
representative  of  the  amount  of  contact  between  the  vocal  folds.  An 
objective  of  the  device  is  to  provide  a measure  of  vocal  fold  activity 
decoupled  from  the  effects  of  the  supraglottal  system. 

A schematic  depiction  of  the  instrument  is  shown  in  Figure  2-2(a). 
Basically,  the  instrument  measures  the  electrical  impedance  variations 
of  the  larynx  using  a pair  of  plate  electrodes  held  in  contact  with 
the  skin  on  both  sides  of  the  thyroid  cartilage.  The  impedance 
measurement  works  by  providing  a signal  which  acts  as  a probe  frequency 
of  alternating  current  to  one  of  the  skin  electrodes.  This  radio- 
frequency current  is  amplitude  modulated  by  the  tissue  impedance  as 
it  varies  due  to  the  vibrating  vocal  folds.  The  sensing  electrode 
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(a) 


(b) 


Figure  2-2.  (a)  A system  configuration  for  the  electroglottograhhy  (EGG) 
and  (b)  the  EGG  waveform  output. 
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detects  this  current,  which  is  then  demodulated  by  a detector  circuit 
to  produce  the  EGG  signal  (Figure  2-2(b)).  A comprehensive  review 
about  the  EGG  instrumentation,  the  waveform  interpretation  and  the 
applications  was  given  by  Childers  and  Krishnamurthy  [1985]. 

Source-Feature  Extraction 

The  objective  of  the  speech  and  EGG  signal  analysis  in  this 

research  was  to  extract  glottal  excitation  features  that  are  important 
in  characterizing  various  voice  qualities.  In  general,  the  vocal  fold 
vibratory  patterns  are  determined  by  the  interplay  of  three  factors: 

1)  the  aerodynamic  properties  of  the  airflow  which  actuates  the  larynx, 

2)  the  adjustment  of  the  laryngeal  muscles,  and  3)  the  configuration 
and  myoelastic  properties  of  the  vocal  folds.  The  terms  that  describe 
a vocal  fold  vibratory  pattern  include: 

(1)  frequency  of  vibration, 

(2)  excursion  of  vibration, 

(3)  glottal  open  quotient  of  the  vocal  fold  vibration, 

(4)  time  ratio  of  glottal  opening  and  closing  phases  (known  as 

speed  quotient), 

(5)  the  amount  of  turbulent  noise  due  to  an  incomplete  glottal 

closure  and  a high  airflow  rate, 

(6)  longitudinal  and  vertical  phase  differences, 

(7)  the  effective  parts  of  the  vocal  folds  participating  in  the 
phonatory  vibration, 

(8)  coupled  vibration  of  the  false  vocal  folds, 

(9)  regularity  of  vibratory  pattern. 

To  extract  the  essential  features  that  characterize  different 
voice  types,  three  categories  of  analysis  techniques  were  developed  and 
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performed  in  this  research.  These  include:  1)  estimation  of  glottal 
waves  by  inverse  filtering  techniques,  2)  extraction  of  source-related 
features  directly  from  speech  signals,  and  3)  EGG  waveform  analysis. 
The  analysis  of  pitch-period  perturbation  (jitter)  and  signal  amplitude 
perturbation  (shimmer)  has  been  done  extensively  and  found  to  be 
directly  related  to  the  voice  quality  of  harshness  (e.g.,  Coleman, 
1960;  Moore,  1962;  Wendahl,  1963;  Deal  and  Emanuel,  1978;  Davis,  1979; 
Horii,  1980;  Heiberger  and  Horii,  1982;  Hiller  et  al.,  1983;  Hirano  et 
al.,  1985;  Askenfelt  and  Hammarberg,  1986;  Kasuya  et  al.,  1986;  Wolfe 
and  Steinfatt,  1987].  We  have  not  repeated  these  studies  here. 

Glottal  wave  estimation.  The  airflow  from  the  glottis  is  the 
source  of  power  that  excites  the  vocal  tract  and  produces  voiced  sound. 
The  pitch,  intensity,  and  waveshape  of  the  glottal  wave  determine  the 
perceived  voice  quality.  In  this  research,  a new  inverse  filtering 
technique  was  developed  to  estimate  the  glottal  waves  from  speech 
signals.  The  adequacy  of  the  proposed  method  was  tested  and  verified 
by  using  synthetic  speech  signals.  Then,  this  inverse  filtering 
technique  was  used  to  estimate  glottal  waves  of  various  types  of 
phonation.  The  temporal  and  spectral  characteristics  of  the  estimated 
glottal  waves  were  studied. 

Extracting  source  features  from  speech  signals.  Signal  processing 
techniques  were  developed  to  extract  source-related  features  directly 
from  speech  signals  (instead  of  glottal  waves).  Compared  to  the 
glottal  inverse  filtering  operation,  these  analysis  techniques  are 
less  restrictive  and  are  economic  in  computational  load.  We  studied 
the  feasibility  of  using  these  analysis  techniques  to  extract  source 
features  that  are  useful  in  characterizing  voice  quality.  The  source 
features  investigated  included  the  spectral  slope,  the  amount  of 
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glottal  turbulent  noise,  and  the  temporal  energy  distribution  during  an 
individual  pitch  period. 

EGG  waveform  features.  The  EGG  waveform  has  been  found  to  be  a 
valuable  signal  for  assisting  speech  analysis,  such  as  voicing  and 
glottal  event  detection  [Childers  et  al.,  1983;  Krishnamur thy  and 
Childers,  1986].  In  this  research,  we  investigated  EGG  waveform 
features  for  various  types  of  phonation.  Parameters  were  defined  on 
EGG  waveforms  to  predict  vocal  fold  contact  phenomena.  As  mentioned, 
in  this  study  the  EGG  and  speech  signals  were  synchronously  collected. 
This  allowed  a comparison  between  EGG  signals  and  the  corresponding 
inverse  filtered  glottal  waves.  Source  features,  such  as  the  glottal 
open  quotient  and  the  glottal  closing  characteristics,  observed  from 
these  two  different  sources  could  be  cross  verified. 

Source  Modeling  for  Speech  Synthesis 

The  characteristics  of  a source  model  for  speech  synthesis  were 
found  to  affect  the  voice  quality  and  perceptual  naturalness  of  the 
synthetic  speech  [Rosenberg,  1971;  Holmes,  1973;  Childers  et  al.  1987, 
Childers  and  Wu,  1988].  Over  the  years,  a number  of  voice  source 
models  have  been  proposed  [Rosenberg.  1971;  Flanagan,  1972,  Fant,  1979, 
Ananthapadmanabha,  1984;  Fant  et  al.,  1985,  Klatt,  1987].  However, 
the  limited  naturalness  of  synthetic  speech  from  current  speech 
synthesizers  suggested  that  either  something  is  still  missing  from  the 
voice  source  models,  or  that  we  do  not  yet  know  how  to  control  them 
properly.  In  this  research,  we  studied  the  problem  by  investigating 
the  natural  vocal  excitation  characteristics  for  various  voice  types. 
Based  on  the  analysis  results,  the  major  glottal  factors  that  are 
important  in  characterizing  voice  quality  were  pointed  out.  The 
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significance  of  these  glottal  factors  for  speech  synthesis  were 
studied.  Existing  glottal  waveform  models  were  evaluated  based  on 
their  capability  to  control  these  factors.  Then,  an  improved  source 
model  was  developed  for  producing  natural-sounding  synthetic  speech 
with  a wide  range  of  voice  characteristics. 

Perceptual  Evaluation 

The  final  judge  of  any  speech  synthesis  technique  is  human 
perception.  In  this  research,  we  conducted  listening  tests  to  verify 
the  adequacy  of  the  proposed  voice  source  model  and  to  investigate  the 
auditory  effects  of  selected  glottal  factors.  Using  our  new  source 
model,  synthetic  speech  samples  were  produced  by  systematically  varying 
source  parameters  that  correspond  to  specific  glottal  factors.  The  use 
of  synthetic  speech  ensured  that  a chosen  factor  was  varied  in  a 
controlled  manner,  while  other  factors  not  under  current  investigation 
did  not  vary.  The  results  of  the  listening  tests  revealed  the 
perceptual  correlates  of  the  glottal  factors,  and  provided  useful 
information  about  controlling  source  parameters  to  achieve  desired 
voice  characteristics. 


CHAPTER  3 


VOICE  ANALYSIS  FOR  SOURCE  FEATURE  EXTRACTION 
Glottal  Wave  Estimation 

The  airflow  at  the  glottis  is  the  source  of  power  that  excites 
the  vocal  tract  and  produces  speech.  Speech  researchers  have  long 
been  interested  in  the  nature  and  the  variations  of  the  glottal  volume 
flow  [Flanagan,  1958;  Fant,  I960].  But,  unfortunately,  due  to  the 
physiological  structure,  the  direct  measurement  of  the  glottal  volume 
flow  is  difficult.  However,  the  speech  pressure  waveform  can  be  easily 
measured  and  represents  the  output  of  the  glottal  volume  flow  modulated 
by  the  human  vocal  tract.  Thus,  analysis  methods  are  needed  to 
decompose  the  speech  signal  into  the  vocal  tract  and  glottal  wave 
components.  Glottal  inverse  filtering  achieves  this  separation. 

According  to  the  linear  source-filter  theory  of  speech  production 
[Fant,  1960],  voiced  speech  can  be  regarded  as  a deterministic  glottal 
wave  being  input  into  a linear  system,  characterized  by  the  vocal  tract 
resonances.  Inverse  filter  analysis  is  based  on  the  assumption  that 
it  is  possible  to  devise  a filter  operating  on  the  speech  signal  that 
reverses  the  transformation  performed  by  the  vocal  tract,  revealing  the 
underlying  glottal  waveform. 

Figure  3-1  shows  the  block,  diagram  representation  of  a linear 
speech  production  model.  The  input  glottal  volume  velocity  is  denoted 
by  Ug(n)  and  the  output  speech  pressure  waveform  is  denoted  by  s(n). 
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The  vocal  tract  model  V(z)  is  assumed  to  be  an  all-pole  model  of  the 


form  [Atal  and  Hanauer,  1971] 


1 


V(z)  = 


(3-1) 


K 

1 + Z z_1 
i=l 


where  K is  the  order  of  the  linear  prediction  (LP)  model  and  aj,  i * 1, 
2,...,K,  are  the  model's  prediction  coefficients. 

The  speech  pressure  wave  is  related  to  the  oral  volume  velocity  at 
the  lips  through  a radiation  impedance  R(z).  For  frequencies  below 
about  4000  Hz,  R(z)  can  be  well  represented  by  a high-pass  filter 
[Flanagan,  1972] 


Based  on  the  linear  model  just  described,  glottal  inverse 
filtering  is  conceptually  defined  as  solving  for  Ug(z)  by  the  equation 


The  relationships  going  from  the  speech  pressure  waveform  to  the 
glottal  volume  velocity,  in  the  form  of  an  analysis  model,  are 
indicated  in  Figure  3-2-(a).  Since  R(z)  is  the  same  for  different 
speech  sounds,  the  fundamental  problem  in  the  estimation  of  the  glottal 
volume  velocity  waveform,  Ug(n),  is  to  determine  the  parameters  of  the 
inverse  filter  1/V(z). 


R(z)  3=  1 - z"1 


(3-2) 


S(z) 


V(z)  R(z) 


(3-3) 


Furthermore,  since  all  systems  in  the  speech  production  model  are 
linear,  the  lip  radiation  and  vocal  tract  filters  can  be  interchanged, 
leading  to  the  arrangement  of  Figure  3-2-(b).  By  combining  the  lip 
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radiation  with  the  glottal  excitation,  an  effective  driving  function 
q(n)  can  be  defined  by 

q(n)  = ug(n)  * r(n)  (3-4-a) 

or 

Q(z)  = Ug(z)  R(z)  (3-4-b) 

where  * denotes  convolution.  Thus,  the  linear  model  of  Figure  3-1  is 
equivalently  described  by  the  model  in  Figure  3-3. 

In  terms  of  the  effective  driving  function  q(n),  the  speech 
production  model  is  of  the  form 

K 

s(n)  = - E a^s(n-i)  + q(n)  (3-5) 

i-1 

where  K and  {a^ ) are  as  defined  in  equation  (3-1). 

Closed-Phase  Analysis 

Berouti  [1976]  proposed  a method  for  an  accurate  estimation  of  the 
vocal  tract  transfer  function  by  analyzing  speech  signals  over  the 
closed  glottal  interval.  The  principle  behind  the  closed  phase 
analysis  can  be  illustrated  as  follows. 

During  the  interval  of  glottal  closure,  the  glottal  volume 
velocity  Ug(n)  is  zero,  and  so  is  the  driving  function  q(n).  Thus, 
equation  (3-5)  becomes: 

K 

s(n)  = - L a^s(n-i)  (3-6) 

i=l 

i.e.,  one  sample  after  the  glottal  closure  instant  (denoted  by  nc),  the 
speech  waveform  is  strictly  a function  of  the  vocal  tract  resonances 
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glottal 

volume  velocity 


speech 

pressure  wave 


ug(n) 

Vocal  Tract 
Transfer  Function 

Lip  Radiation 

Ug(z) 

V(z) 

R(z) 

Figure  3-1.  Block  diagram  representation  of  linear  speech  production 
model. 
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R(z) 
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Ug(z) 


ug(n) 

Ug(z) 
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Figure  3-2.  (a)  Block  diagram  of  conceptualized  glottal  inverse 
filtering  model,  (b)  modified  model  from  part  (a). 


q(n)=ug(n)*r(n) 

Q(z)=Ug(z)-R(z) 


V(z) 


s(n) 

S(z) 


Figure  3-3.  Equivalent  representation  of  the  linear  speech  production 
model  shown  in  Figure  3-1,  in  terms  of  an  effective  driving  function 
and  vocal  tract  transfer  function. 
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specified  by  aj_,‘**faK  and  the  initial  conditions  s(nc) , • • • ,s(nc-K+l) . 
This  result  holds  over  the  entire  closed  glottal  interval.  The  filter 
parameters  can  be  estimated  by  the  covariance  method  of 
linear  prediction  (LP)  analysis  [Markel  and  Gray,  1976].  However,  for 
this  method  to  work  the  closed  interval  must  be  detected.  Two  methods 
were  compared  in  this  study.  Firstly,  we  studied  the  two-channel 
analysis  technique  proposed  by  Krishnamurthy  [1983]  and  Krishnamurthy 
and  Childers  [1986],  in  which  a synchronized  differential  EGG  (DEGG) 
signal  was  used  to  locate  the  closed  glottal  phase.  This  technique  was 
called  a "pseudo  closed  phase"  method  by  Krishnamurthy  [1983]  since  the 
EGG-detected  closed  phase  might  not  be  exact  but  was  found  to  provide 
a satisfactory  accuracy.  A second  method,  also  a pseudo  closed  phase 
method,  was  proposed  in  this  study  and  was  referred  to  as  "two-pass 
method. " 

Two-pass  Method 

It  is  known  that  the  LP  error  signal  (i.e.  the  driving  function, 
q(n),  defined  in  equation  (3-4-a))  for  a voiced  speech  waveform  is 
characterized  by  peaked  pulses  separated  by  the  pitch  periods  [Atal  and 
Hanauer,  1971;  Markel,  1972,  1973].  These  peaked  pulses  represent  the 
main  excitations  to  the  LP  vocal  tract  filter.  Markel  [1972,  1973] 
proposed  a pitch  estimation  method  that  employs  the  autocorrelation 
function  of  the  LP  error  signal.  This  method  is  capable  of  providing 
accurate  pitch  periods,  but  loses  all  information  about  the  absolute 
position  of  the  glottal  excitation.  We  studied  the  nature  of  the  LP 
error  signal  by  comparing  it  with  the  synchronous  EGG  signal.  Figure 
3-4  shows  an  LP  error  signal  derived  from  a fixed-frame  LP  analysis 
[Markel  and  Gray,  1976]  and  the  synchronous  EGG  and  DEGG  signals.  As 
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(a)  EGG 


(b)  DEGG 


DEGG-based  closed  phase 
< ► 


(c)  LP  residual  error  function 


Instant  of  the 
main  excitation 


Figure  3-4.  Synchronized  (a)  EGG  signal,  (b)  differential  EGG  signal, 
and  (c)  LP  residual  error  function. 
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can  be  seen,  the  main  excitation  pulses  in  the  LP  error  function 
consistently  match  the  negative  peaks  of  the  DEGG  signal,  which  were 
found  very  close  to  the  instants  of  glottal  closure  [Childers  et  al., 
1983;  Krishnamurthy , 1983].  This  observation  led  to  the  development 
of  the  "two-pass  method,"  a method  for  accurate  and  automatic  glottal 
inverse  filtering. 

The  basic  idea  of  the  two-pass  method  is  to  identify  the  locations 
of  the  main  excitation  pulses  using  the  LP  error  signal  derived  in  the 
first  pass  of  the  inverse  filtering  procedures.  Then,  using  the  main 
excitation  pulses  as  indicators  of  glottal  closure,  a "pseudo  closed 
phase"  is  selected  as  the  analysis  interval  for  a pitch-synchronous 
covariance  LP  analysis  to  estimate  the  vocal  tract  filter,  which  in 
turn  is  used  to  obtain  the  desired  glottal  volume-velocity  waveform. 
Like  the  EGG  signal,  the  estimated  main  excitation  may  not  provide  a 
perfect  indication  for  glottal  closure.  But,  the  key  feature  of  the 
two-pass  method  is  that  it  ensures  the  exclusion  of  the  main  excitation 
pulse  from  the  analysis  interval.  This  choice  of  analysis  interval 
increases  the  accuracy  of  the  LP  analysis. 

A block  diagram  of  the  two-pass  method  is  shown  in  Figure  3-5. 
In  the  first-pass  procedure,  a pitch-asynchronous  (fixed  frame)  LP 
analysis  was  performed  on  the  input  speech  signal  s(n).  The  estimated 
LP  filter,  V^(z),  was  used  to  derive  the  corresponding  LP  error  signal, 
q^(n),  by  an  inverse  filtering  operation.  As  mentioned  above,  for  a 
voiced  speech  signal,  the  LP  error  function  is  characterized  by  a pulse 
train  with  the  appropriate  pitch  period.  The  locations  of  these 
excitation  pulses  were  detected  by  a peak-picking  method  and  were  used 
as  indicators  of  glottal  closure.  In  the  second-pass  procedure,  a 
pitch-synchronous  covariance  LP  analysis  was  used  to  estimate  another 
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s(n) 


Figure  3-5.  Block  diagram  of  the  two-pass  method  for  glottal  inverse 
filtering  analysis. 
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improved  LP  filter,  V2(z).  For  each  pitch  period,  the  criterion  for 
determining  the  analysis  interval  was  to  pick  the  samples  starting  at 
one  point  after  the  instant  of  the  excitation  pulse.  The  duration  of 
the  interval  was  35%  of  the  entire  pitch  period,  provided  it  was  at 
least  twice  of  the  order  of  the  LP  filter  (the  minimum  data  length 
required  for  covariance  LP  analysis).  The  formant  resonances  of  the 
vocal  tract  were  estimated  by  solving  the  roots  of  the  LP  polynomial, 
and  then  shaping  the  formant  structure  by  empirical  rules,  which 
includes  (1)  discarding  the  roots  with  center  frequencies  under  250  Hz, 
(2)  discarding  the  roots  with  bandwidths  greater  than  500  Hz,  and  (3) 
merging  two  adjacent  roots.  The  refined  formant  resonances  were  then 
used  to  construct  the  vocal  tract  transfer  function,  which  was  used 
in  the  final  glottal  inverse  filtering  procedure.  The  direct  output 
of  the  glottal  inverse  filtering  operation  is  a differential  glottal 
volume-velocity  u'g(n)  (i.e.,  the  equivalent  driving  function  to  the 
vocal  tract  filter),  which  represents  the  combined  effect  of  the  lip 
radiation  and  the  glottal  volume-velocity.  A glottal  volume-velocity 
waveform,  Ug(n),  is  derived  by  carrying  out  an  integration  to  cancel 
out  the  effect  of  the  lip  radiation. 

The  validity  of  the  two-pass  method  was  verified  by  testing  it 
with  synthetic  speech  signals.  The  synthetic  speech  signals  were 
produced  by  a cascade  formant  synthesizer  [Klatt,  1980]  excited  by 
stylized  glottal  pulses  generated  by  the  LF  model  [Fant  et  al.,  1985]. 
Two  sets  of  formant  resonances  (Table  3-1)  were  used,  each  with  typical 
formant  frequencies  and  bandwidths  of  vowels  /a/  and  / i / , respectively. 
Three  excitation  pulses  were  used,  with  the  first  (EXC  1)  having  a 
typical  glottal  waveshape  of  normal  male  phonations  (a  closed  phase 
of  35%  of  the  pitch  period  and  a medium  pulse  skewness),  the  second 
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(EXC  2)  having  a long  closed  phase  and  a high  degree  of  pulse  skewness, 
and  the  third  (EXC  3)  having  and  no  closed  phase  and  a low  degree 
of  pulse  skewness.  These  two  sets  of  formant  resonances  and  three 
excitation  pulses  were  used  to  produce  six  vowel  samples.  For  each 
vowel  sample,  the  two-pass  inverse  filtering  method  were  tested  with 
various  LP  filter  orders  and  with  or  without  a preemphasis  filter, 
1 - z-1,  before  the  covariance  LP  analysis. 

Table  3-1  summarizes  the  testing  results.  These  results  revealed 
that,  when  appropriate  analysis  parameters  were  used,  the  two-pass 
method  is  capable  of  estimating  glottal  waves  with  high  accuracy. 
Our  data  also  showed  that,  in  all  the  tests,  the  locations  of  the 
excitation  pulses  were  successfully  detected  in  the  first-pass 
procedure.  A typical  example  is  presented  in  Figure  3-6,  which  shows 
the  original  excitation  wave,  the  synthetic  speech  waveform,  the  LP 
error  function  derived  in  the  first-pass  procedure,  and  the  estimated 
excitation  wave  after  the  second-pass  inverse  filtering. 

The  analysis  parameters  that  affect  the  performance  of  the  glottal 
inverse  filtering  included  the  LP  filter  order  (NLP)  and  the  use  of  a 
preemphasis  filter  (PRE).  Our  experimental  results  showed  that,  for 
a voiced  speech  signal  with  five  formant  resonances,  the  acceptable 
minimum  NLP  was  12.  Thus,  in  addition  to  the  10  poles  required  for 
the  simulation  of  the  formant  resonances,  at  least  2 extra  poles  were 
needed  to  account  for  the  effect  of  the  source  excitation.  The  optimum 
NLP,  which  was  unknown  before  the  inverse  filtering  analysis,  was 
dependent  on  the  characteristics  of  the  actual  excitation  wave.  Two 
extra  poles  were  found  to  be  enough  to  account  for  an  excitation  wave 
with  a spectral  roll-off  rate  not  higher  than  12  dB/octave;  higher  NLPs 
did  not  necessarily  improve  the  performance.  But,  for  excitation  waves 


35 


Table  3-1.  The  analysis  parameters  and  the  estimated  formant  resonances 
using  the  two-pass  inverse  filtering  method. 


Formant  Resonance  (Hz) 


Original 

Fl 

BVl 

F2 

BV2 

F3 

BW3 

F4 

BV4 

F5 

"bu5 

NLP 

PRE 

/a/ 

650 

50 

1100 

80 

2650 

120 

3420 

200 

4800 

300 

Test  1-1 

645 

103 

1099 

87 

2650 

142 

3426 

216 

4806 

318 

12 

No 

Test  1-2 

648 

73 

1098 

70 

2650 

131 

3422 

203 

4799 

295 

12 

Yes 

Test  1-3 

651 

93 

1102 

72 

2647 

132 

3421 

212 

4799 

285 

14 

No 

Test  1-4 

653 

72 

1094 

72 

2651 

117 

3411 

207 

4800 

270 

14 

Yes 

Test  2-1 

642 

93 

1095 

84 

2653 

131 

3424 

202 

4798 

290 

12 

No 

Test  2-2 

661 

79 

1101 

66 

2648 

139 

3426 

214 

4797 

271 

12 

Yes 

Test  2-3 

640 

106 

1100 

92 

2646 

127 

3419 

216 

4799 

285 

14 

No 

Test  2-4 

653 

72 

1094 

72 

2651 

117 

3411 

207 

4800 

270 

14 

Yes 

Test  3-1 

920 

423 

* 

★ 

★ 

★ 

* 

* 

★ 

★ 

12 

No 

Test  3-2 

642 

140 

1191 

87 

2662 

124 

3425 

174 

4778 

252 

12 

Yes 

Test  3-3 

727 

408 

1145 

528 

★ 

* 

* 

★ 

★ 

★ 

14 

No 

Test  3-4 

645 

148 

1093 

90 

2661 

128 

3427 

181 

4778 

244 

14 

Yes 

Original 

/i/ 

350 

40 

2100 

80 

2600 

150 

3300 

200 

3800 

250 

Test  4-1 

338 

79 

2100 

74 

2604 

147 

3304 

194 

3806 

244 

12 

No 

Test  4-2 

354 

71 

2099 

76 

2602 

150 

3303 

199 

3803 

250 

12 

Yes 

Test  4-3 

336 

89 

2098 

72 

2598 

148 

3300 

203 

3802 

253 

14 

No 

Test  4-4 

352 

67 

2100 

71 

2599 

142 

3297 

202 

3799 

257 

14 

Yes 

Test  5-1 

349 

57 

2102 

69 

2605 

144 

3305 

189 

3807 

242 

12 

No 

Test  5-2 

369 

67 

2099 

75 

2601 

156 

3306 

201 

3809 

254 

12 

Yes 

Test  5-3 

345 

61 

2098 

67 

2597 

147 

3300 

206 

3802 

258 

14 

No 

Test  5-4 

366 

59 

2101 

66 

2599 

136 

3293 

206 

3796 

270 

14 

Yes 

Test  6-1 

244 

274 

2170 

643 

★ 

★ 

* 

* 

* 

★ 

12 

No 

Test  6-2 

345 

89 

2104 

81 

2623 

193 

3340 

219 

3846 

382 

12 

Yes 

Test  6-3 

278 

203 

2055 

407 

2720 

635 

3503 

693 

* 

* 

14 

No 

Test  6-4 

346 

95 

2108 

79 

2626 

154 

3317 

179 

3794 

304 

14 

Yes 

* denotes  failure  of  estimation. 
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(a)  Speech  waveform 


(c)  Original  excitation  wave  (lower  trace)  and  its  integration  (upper 
trace) 


(d)  Estimated  excitation  wave  (lower  trace)  and  its  integration  (upper 
trace) 


Figure  3-6.  A test  example  for  the  two-pass  glottal  inverse  filtering 
method. 
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with  higher  spectral  roll-off  rates,  additional  poles  were  required. 
Our  data  also  revealed  that,  in  general,  the  greatest  estimation  error 
appeared  in  the  first  formant.  This  result  implied  that  an  all-pole 
model  can  better  approximate  the  high-frequency  roll-off  trend  than  the 
low-frequency  characteristics  of  the  glottal  waves. 

Preemphasis  in  the  speech  literature  refers  to  a simple  high-pass 
filtering  of  the  speech  signal.  Theoretically,  a true  closed  phase 
analysis  does  not  require  preemphasis  (equation  (3-6)).  However,  in 
practical  LP  analyses,  the  use  of  a preemphasis  filter  often  reduces 
the  ill  conditioning  of  the  computation  [Markel  and  Gray,  1976].  Our 
experiments  showed  mixed  results.  For  original  excitation  pulses  with 
relatively  long  closed  phases  (EXC  1 and  EXC  2),  the  use  of  preemphasis 
might  slightly  improve  or  degrade  the  accuracy  of  the  formant 
estimation.  The  differences  were  not  significant.  On  the  other  hand, 
for  original  excitation  pulses  with  a short  or  even  no  closed  phase 
(e.g.,  EXC  3),  preemphasis  becomes  very  important.  For  example, 
without  preemphasis,  we  failed  to  estimate  the  formant  resonances  from 
the  synthetic  speech  signals  excited  by  EXC  3 (tests  3-1,  3-3,  6-1,  and 
6-3).  But,  once  a preemphasis  filter  was  used,  dramatic  improvements 
were  achieved  (tests  3-2,  3-4,  6-2,  and  6-4).  For  these  cases,  the 
high-pass  characteristic  of  preemphasis  has  made  the  speech  signals 
better  fit  the  all-pole  model.  Figure  3-7-(a)  shows  the  speech 
spectrum  of  a synthetic  vowel  /a/  excited  by  EXC  3 (i.e.,  the  speech 
signal  used  in  tests  3-1  and  3-3).  The  sinusoid-like  excitation  wave, 
EXC  3,  leads  to  an  overwhelming  spectral  peak  at  the  low-frequency 
interval.  Figure  3-7-(b)  shows  the  preemphasized  version  of  Figure 
3-7-(a),  in  which  the  formant  resonances  were  boost  up  by  the  high-pass 
preemphasis . 
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(a) 


(b) 


Figure  3-7.  The  speech  spectra  of  a synthetic  vowel  /a/  excited  by  the 
excitation  wave  EXC  3:  (a)  without  preemphasis,  (b)  with  preemphasis. 
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In  summary,  the  proposed  two-pass  method  is  capable  of  estimating 
glottal  waveforms  with  high  accuracy,  provided  appropriate  analysis 
parameters  are  used.  Figure  3-8  shows  the  three  original  excitation 
waves  and  their  counterparts  estimated  by  the  two-pass  method.  As  can 
be  seen,  even  the  excitation  wave  with  no  closed  phase  (EXC  3)  was 
recovered  quite  well  by  the  two-pass  method  (Figure  3-8-(h)  and  (i)). 

Aside  from  the  inverse  filtering  method,  to  recover  a natural 
glottal  wave  accurately  the  original  speech  signal  has  to  be  collected 
without  magnitude  and  phase  distortions  throughout  the  frequencies  of 
interest  (DC  to  1 KHz).  In  this  study,  we  simultaneously  collected 
speech  signals  with  two  types  of  microphones  (B&K  4133  and  RE-10,  see 
Chapter  2)  and  compared  the  outcomes  of  the  inverse  filtering  analyses. 
The  results  of  this  experiment  is  presented  in  Appendix  A,  which 
revealed  the  crucial  importance  of  having  undistorted  speech  signals 
for  estimating  glottal  waves  by  inverse  filtering.  Thus,  for  our 
formal  inverse  filtering  analysis,  only  the  speech  signals  measured  by 
a condenser  microphone  (B&K  4133)  with  a good  low-frequency  response 
followed  by  a direct  digital  conversion  were  used. 

The  two-pass  method  was  performed  on  our  natural  speech  samples 
of  various  voice  types  to  estimate  the  underlying  glottal  waves.  The 
results  showed  that  there  is  no  significant  difference  between  the 
glottal  waves  estimated  by  the  two-pass  method  and  the  two-channel 
method  [Krishnamurthy  and  Childers,  1986].  In  fact,  the  analysis 
intervals  selected  by  these  two  methods  were  close  to  each  other  (e.g., 
Figure  3-4).  Thus,  the  two-pass  method  achieved  the  accuracy  of  the 
two-channel  closed  phase  analysis,  with  no  extra  auxiliary  signal  (EGG) 
being  required  to  assist  the  detection  of  the  glottal  closure 


intervals. 
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(a)  Original  excitation  wave  (EXC  1) 


(b)  Estimated  excitation  wave  (test  1-3) 


(c)  Estimated  excitation  wave  (test  4-4) 


Figure  3-8.  The  original  and  estimated  excitation  waves  using  the 
two-pass  inverse  filtering  method. 
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(d)  Original  excitation  wave  (EXC  2) 


(e)  Estimated  excitation  wave  (test  2-4) 


(f)  Estimated  excitation  wave  (test  5-1) 


Figure  3-8.  (Continued) 
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(g)  Original  excitation  wave  (EXC  3) 


(h)  Estimated  excitation  wave  (test  3-4) 


(i)  Estimated  excitation  wave  (test  6-4) 


Figure  3-8.  (continued) 
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Glottal  Waveform  Characteristics 

In  general,  a complete  vocal  fold  vibratory  cycle  during  phonation 
includes  an  open  phase  and  a closed  phase,  and  the  open  phase  can  be 
further  divided  into  an  opening  phase  and  a closing  phase.  In  this 
section,  we  investigate  the  common  and  distinctive  glottal  waveform 
characteristics  for  various  voice  types  in  terms  of  these  glottal 
phases . 

Figure  3-9  shows  the  inverse  filtered  effective  excitation  waves 
(i.e.,  the  differential  glottal  volume-velocity  waveforms,  Ug'(t)) 
of  various  voice  types.  Some  high-frequency  "noise"  appears  on  these 
waves,  presumably  due  to  an  incomplete  cancellation  of  the  vocal 
tract  transfer  function.  Nevertheless,  the  integrations  show  quite 
reasonable  glottal  volume-velocity  waveforms,  Ug(t)  (also  shown  in 
Figure  3-9).  A close  observation  on  these  glottal  volume-velocity 
waveforms  revealed  that,  for  all  the  voice  types  under  investigation 
(i.e.,  modal,  vocal  fry,  falsetto,  and  breathy  voices),  the  closing 
phase  exhibits  a steeper  change  of  slopes  than  the  opening  phase.  And 
thus,  over  each  pitch  period,  Ug'(t)  shows  a sharp  excitation  pulse  at 
the  instant  of  maximum  closing  slope.  The  distinctive  features  that 
distinguish  different  voice  types  are  the  sharpness  of  the  excitation 
pulse  and  the  location  at  which  the  maximum  closing  slope  occurs.  For 
modal  and  vocal  fry  phonations  (Figure  3-9-(a)  to  (d)),  the  maximum 
closing  slope  occurs  at  an  instant  near  the  glottal  closure  and  results 
in  an  abrupt  termination  of  the  glottal  airflow.  On  the  other  hand, 
for  falsetto  and  breathy  phonations  (Figure  3-9-(e)  to  (h)),  the 
maximum  closing  slope  occurs  around  the  middle  of  the  closing  branch 
followed  by  a residual  phase  of  progressive  closure.  We  also  found 
progressive  glottal  closure  during  the  offset  of  a normal  voiced  sound, 


(a)  modal  voice  (DMH16) 


(b)  modal  voice  (DMH3) 


Figure  3-9.  The  inverse  filtered  differential  glottal  waves  (lower)  and 
their  integrations  (upper). 
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(c)  modal  voice  (CKL3) 


20  ms  ► 


(d)  vocal  fry  (CKL11) 


Figure  3-9.  (Continued) 
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(e)  falsetto  (CKL7) 


(f)  falsetto  (DMH8) 


< 15  ms  > 


Figure  3-9.  (Continued) 
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(g)  breathy  voice  (EDR2) 


< 20  ms  > 


(h)  breathy  voice  (JMS2) 


< 20  ms  > 


Figure  3-9.  (Continued) 
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where  the  vocal  effort  is  decreasing.  Figure  3-10  shows  the  ending 
part  of  an  inverse  filtered  differential  glottal  wave  of  a vowel 
sample.  This  example  depicts  decreasing  sharpness  of  excitation  pulses 
with  decreasing  vocal  effort. 

In  addition  to  the  abruptness  of  glottal  closure,  the  glottal 
waves  of  different  voice  types  were  also  found  to  be  characterized  by 
different  glottal  pulse  widths  and  different  glottal  pulse  skewnesses 
(i.e.,  the  time  ratio  of  the  opening  phase  to  the  closing  phase).  In 
the  aspect  of  the  glottal  pulse  width,  our  samples  showed  that  the 
modal  phonations  have  medium  values  around  65%  of  the  pitch  periods, 
the  vocal  fry  phonations  have  small  values  around  25%,  while  the 
falsetto  and  breathy  phonations  have  quite  high  percentages,  which 
sometimes  make  the  existence  of  the  closed  phase  uncertain.  As  for  the 
glottal  pulse  skewness,  the  voice  types,  ranking  in  order  of  decreasing 
skewness,  were  vocal  fry,  modal,  falsetto  and  breathy  voices. 

Glottal  Spectral  Characteristics 

It  is  instructive  to  study  the  spectra  of  glottal  waves  since  the 
auditory  impression  is  closely  related  to  spectral  features  [Flanagan, 
1972].  In  this  study,  we  investigated  the  spectral  characteristics  of 
the  glottal  source  of  various  voice  types.  Our  observations  suggested 
that  the  glottal  waves  of  different  voice  types  can  be  distinguished  by 
two  aspects:  (1)  the  general  spectral  trend,  and  (2)  the  intensity 
relations  between  the  fundamental  frequency  and  the  higher  harmonics. 

Spectral  slope.  As  pointed  out  by  Monsen  and  Engebretson  [1977], 
the  glottal  spectrum  can  be  rarely  described  in  terms  of  a spectral 
slope  with  a constant  dB  per  octave.  Generally,  the  envelope  falls 
off  in  an  irregular  manner.  In  this  study,  we  represent  the  general 
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Figure  3-10.  The  differential  glottal  wave  at  the  end  of  a normal 
voicing. 
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spectral  trend  of  a glottal  volume  flow  as  a transfer  function  of  a 
low-pass  filter  with  multiple  real  poles.  The  method  is  described 
below. 

For  normal  phonations,  Flanagan  [1958]  gave  an  average  value  of 
-12  dB/octave  for  the  slope  of  the  glottal  spectra.  This  lead  to  the 
commonly  accepted  two-pole  model  for  approximating  the  general  spectral 
trend  of  a normal  glottal  volume  flow: 


Ug(Z) 


K 

(l-zaz_1)(l-z5z-1) 


(3-7) 


where  K is  a constant  related  to  the  amplitude  of  the  glottal  flow  and 
za,  zb  are  real  poles  inside  the  unit  circle. 

Figure  3-11  shows  the  the  Fourier  spectra  and  the  corresponding 
two-pole  approximations  for  the  glottal  waves  of  various  voice  types. 
The  results  showed  that  the  two-pole  model  is  appropriate  for  modal 
phonations  (Figure  3-ll-(a)  and  (b)).  This  model  also  well  simulated 
the  spectral  trend  of  a low-pitched  vocal  fry  (Figure  3-ll-(c)),  except 
for  a small  low-frequency  interval  (below  the  third  harmonic). 
However,  the  two-pole  model  apparently  was  not  adequate  for  falsetto 
and  breathy  phonations,  which  have  spectral  roll-off  rates  considerably 
higher  than  12  dB/octave  (Figure  3-ll-(d)  to  (f)).  Thus,  for  these  two 
types  of  phonation,  a three-pole  model  is  required,  i.e., 


Ug(z) 


K 

(l-zaz“1)(l-zbz-1)(l-zcz_1) 


(3-8) 


where  zc  is  the  third  real  pole  inside  the  unit  circle. 

To  solve  equation  (3-8),  an  ordinary  linear  prediction  (LP)  method 
can  not  be  directly  applied  since  the  poles  are  restricted  to  have  only 
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(a)  Modal  voice  (DMH16) 


Figure  3-11.  Glottal  spectra  and  the  corresponding  two-pole  model 
approximations  for  various  voice  types 
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(b)  Breathy  voice  (EDR2) 


Figure  3-11.  (Continued) 
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real  values.  We  used  a method  with  adaptive  preemphasis  filters  and 
a first-order  LP  analyses.  Figure  3-12  shows  the  block  diagram.  The 
input  glottal  wave  was  first  preemphasized  by  a high-pass  filter  of  the 
form  l-z--*-  (i.e.,  set  za  = 1).  Then,  a first-order  LP  analysis  was 
performed  on  the  preemphasized  glottal  wave  to  derive  the  prediction 
coefficient  zx.  If  zx  was  less  than  .97,  set  z^  = zx  and  terminated 
the  procedure.  Otherwise  (zx  > .97),  the  preemphasized  glottal  wave 
was  once  again  high-pass  filtered  by  1-z--*-  (i.e.,  set  z^  = 1)  before 
it  went  through  a first-order  LP  analysis  to  derive  the  prediction 
coefficient  zQ. 

Table  3-2  lists  the  values  of  z^  and  zc  for  the  inverse  filtered 
glottal  waves  of  various  voice  types.  The  results  confirmed  that  the 
general  spectral  trend  of  a modal  or  vocal  fry  phonation  can  be  modeled 
by  two  real  poles.  On  the  other  hand,  to  simulate  the  spectral  trend 
of  a falsetto  or  breathy  phonation,  one  extra  real  pole  is  required  to 
account  for  its  steeper  roll-off  rate.  Figure  3-13  shows  the  improved 
spectral  matching  using  the  three-pole  model  for  the  falsetto  and 
breathy  phonations. 

We  noticed  that  the  multi-pole  model  just  described  can  better 
approximate  the  high-frequency  spectral  trend  than  for  the  low- 
frequency  characteristics  of  a glottal  wave.  Since  the  glottal  wave 
must  be  of  finite  duration,  an  exact  model  should  be  a finite  impulse 
response  filter,  and  hence,  contains  only  zeros.  Our  results  showed 
that  the  use  of  an  all-pole  model  is  more  likely  to  cause  mismatch  at 
the  low-frequency  interval  (see  Figures  3-11  and  3-13).  This  explains 
why  higher  errors  appeared  in  the  first  formant  estimates  in  our 
inverse  filtering  analysis. 
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Glottal  Wave 


Figure  3-12.  Block  diagram  for  computing  real  poles  for  the  low-pass 
filter  model  of  glottal  waves. 
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Table  3-2.  The  estimated  real  poles  for  the  low-pass  filter  model 
of  the  glottal  waves. 


Subject 

Voice  Type 

FO  (Hz) 

zb 

zc 

CKL11 

vocal  fry 

46 

.83 

DMH13 

vocal  fry 

45 

.81 

_ 

DMH9 

slight  vocal  fry 

90 

.83 

- 

CKL3 

modal 

155 

.92 

DMH3 

modal 

106 

.92 

_ 

DMH16 

modal 

126 

.96 

- 

CKL7 

falsetto 

210 

1.00 

.60 

DMH8 

falsetto 

312 

1.00 

.73 

EDR2 

breathiness 

137 

1.00 

.25 

JMS2 

breathiness 

200 

1.00 

.87 

56 


(A)  Falsetto  (DMH8) 


(b)  Breathy  voice  (EDR2) 


(c)  Breathy  voice  ( JMS2) 


Figure  3-13.  Improved  glottal  spectral  approximations  using  the 
three-pole  model  for  falsetto  and  breathy  voices. 
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Harmonic  relations.  According  to  Holmes  [1973],  the  harmonics  at 
low  frequencies  (below  the  first  formant)  are  of  great  importance  for 
perception,  presumably  due  to  their  high  energy  ratios.  In  this  study, 
we  investigated  the  harmonic  relations  for  various  voice  types.  In 
Figure  3-14,  we  rearranged  the  glottal  spectra  of  various  voice  types 
shown  in  Figure  3-11  by  displaying  only  the  intensities  of  the  first  10 
harmonics.  As  can  be  seen,  aside  from  the  differences  in  the  spectral 
roll-off  rates  discussed  above,  the  glottal  spectra  of  different  voice 
types  showed  distinctive  intensity  relations  between  the  fundamental 
and  the  higher  harmonics.  We  defined  a parameter,  called  "harmonic 
richness  factor"  (HRF)  to  measure  this  intensity  ratio: 

I Hi 

i>2 

HRF  = (3-9) 

Hi 

where  Hj  is  the  intensity  of  the  fundamental  frequency  and  Hi  is  the 
intensity  of  ith  harmonic. 

The  HRF  values  for  various  voice  types  are  listed  in  Table  3-3. 
The  results  showed  that  the  vocal  fry  glottal  waves  have  the  highest 
harmonic  energy  ratios,  while  the  falsetto  and  breathy  glottal  waves 
are  characterized  by  a high-intensity  fundamental.  It  is  interesting 
to  note  that  these  differences  in  the  harmonic  relations  are  of  little 
significance  for  speech  intelligibility  (phonetic  identif iability) , but 
certainly  affect  the  perceived  voice  quality. 

Source-Feature  Extraction  from  Speech  Signals 

Though  a glottal  volume  flow  can  be  estimated  by  inverse  filtering 
the  speech  pressure  wave,  as  described  in  the  previous  section,  the 
inverse  filtering  analysis  is  computationally  expensive  and  the  result 
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1 23  45  67  89  10 


Figure  3-14.  The  intensity  spectrum  (first  ten  harmonics)  of  glottal 
waves  of  various  voice  types. 
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(dB) 


Figure  3-14.  (Continued) 
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Table  3-3.  The  harmonic  richness  factors  (HRF)  for  glottal  waves 
of  various  voice  types. 


Subject 

Voice  Type 

F0  (Hz) 

HRF  (dB) 

CKL11 

vocal  fry 

46 

5.2 

DMH13 

vocal  fry 

45 

5.3 

DMH9 

slight  vocal  fry 

90 

-4.1 

CKL3 

modal 

155 

-9.4 

DMH3 

modal 

106 

-9.1 

DMH16 

modal 

126 

-11.3 

CKL7 

falsetto 

210 

-18.7 

DMH8 

falsetto 

312 

-19.5 

EDR2 

breathiness 

137 

-15.2 

JMS2 

breathiness 

200 

-18.3 
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is  vulnerable  to  modeling  and  computational  errors.  In  this  research, 
we  developed  techniques  that  can  extract  source-related  features 
directly  from  speech  signals.  Parameters  were  defined  to  measure  the 
general  spectral  tilt,  the  amount  of  turbulent  noise,  and  the  temporal 
energy  distribution  over  an  individual  pitch  period. 

General  Spectral  Tilt 

In  the  previous  section,  we  showed  that  the  spectral  slope  of  a 
glottal  volume  flow  is  a distinctive  feature  for  different  voice  types. 
Here,  we  present  a simple  method  to  estimate  this  glottal  feature 
directly  from  a speech  signal.  Estimation  of  the  glottal  wave  is  not 
required. 

It  is  known  that  the  general  spectral  trend  of  a voiced  speech 
signal  is  determined  by  the  combined  contribution  of  the  glottal  pulse 
and  the  lip  radiation.  Using  equations  (3-2)  and  (3-8),  it  can  be 
approximated  by  (assume  za=l) 

K 

U„(z)-R(z)  = — (3-10) 

(l-zbz  1)(l-zcz  !) 

i.e.,  the  general  spectral  trend  of  a speech  signal  is  characterized 
by  zb  and  zc,  two  real  poles  inside  the  unit  circle.  To  estimate  the 
pole  values  in  equation  (3-10),  the  method  of  first-order  LP  analysis 
with  preemphasis  filters  as  shown  in  Figure  3-12  can  be  used.  Here  the 
speech  signal  replaces  the  preemphasized  glottal  wave. 

Table  3-4  lists  the  pole  values  of  zb  and  zc  for  vowel  samples  of 
various  voice  types.  These  results  are  roughly  consistent  with  those 
listed  in  Table  3-2,  i.e.,  the  results  derived  from  the  glottal  waves. 
Figure  3-15-(a)  to  (d)  show  the  typical  speech  spectra  of  modal,  vocal 
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Table  3-4.  The  estimated  real  poles  for  representing  the  general 
speech  spectral  trend. 


Subject 

Voice  Type 

FO  (Hz) 

Vowel 

zb 

zc 

CKL8 

severe  vocal  fry 

50 

/i/ 

.73 

CKL9 

severe  vocal  fry 

48 

/a/ 

.85 

- 

DMH13 

severe  vocal  fry 

45 

/i/ 

.60 

- 

DMH9 

slight  vocal  fry 

90 

m 

.63 

- 

DMH10 

slight  vocal  fry 

91 

/a/ 

.85 

- 

CKL1 

modal 

118 

/i/ 

.80 

_ 

CKL2 

modal 

112 

/a/ 

.88 

— 

DMH1 

modal 

115 

/i/ 

.83 

— 

DMH2 

modal 

123 

/a/ 

.87 

— 

DMH3 

modal 

106 

/i/ 

.86 

— 

DMH16 

modal 

126 

/a/ 

.92 

— 

HBR1 

modal 

142 

/i/ 

.83 

_ 

HBR2 

modal 

141 

/a/ 

.83 

— 

DRW1 

modal 

129 

/i/ 

.79 

— • 

DRW  2 

modal 

128 

/a/ 

.88 

- 

CKL4 

falsetto 

263 

/i/ 

1.00 

.68 

CKL5 

falsetto 

238 

/a/ 

1.00 

.86 

CKL6 

falsetto 

213 

/i/ 

1.00 

.60 

CKL7 

falsetto 

210 

/a/ 

1.00 

.86 

DMH5 

falsetto 

306 

/i/ 

1.00 

.38 

DMH6 

falsetto 

312 

/a/ 

1.00 

.85 

DMH7 

falsetto 

345 

/i/ 

1.00 

.76 

DMH8 

falsetto 

345 

/a/ 

1.00 

.82 

GPM2 

breathy  (mimic) 

108 

/i/ 

1.00 

.07 

DMH14 

breathy  (mimic) 

111 

/i/ 

1.00 

.06 

JMS1 

breathy  (path.) 

212 

/i/ 

1.00 

.58 

JMS2 

breathy  (path.) 

200 

/i/ 

1.00 

.72 

EDR1 

breathy  (path.) 

139 

/i/ 

1.00 

.14 

EDR2 

breathy  (path.) 

134 

/i/ 

1.00 

.34 

PJB1 

breathy  (path.) 

270 

/i/ 

1.00 

.36 
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FREEElHvCY  (HZ) 


(b)  Vocal  fry  (DMH12) 


Figure  3-15.  Speech  spectra  of  various  voice  types. 
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(a)  Falsetto  (DMH5) 


(b)  Breathy  voice  (DMH14) 


Figure  3-15.  (Continued) 
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fry,  falsetto  and  breathy  voices,  respectively.  As  can  be  seen,  the 
spectra  of  falsetto  and  breathy  voices,  which  have  high  values  for  z^ 
and  zc,  are  characterized  by  steep-falling  slopes,  while  the  vocal  fry 
spectrum,  which  is  associated  with  a small  z^,  has  a much  flatter 
slope . 

Our  data  also  showed  that  the  pole  values  estimated  directly  from 
speech  signals  are  affected  by  the  formant  resonances  of  the  speech 
signals.  For  the  same  speaker  phonating  in  the  same  voice  type,  the 
pole  values  estimated  from  samples  of  vowel  / i / are  usually  slightly 
smaller  than  those  estimated  from  samples  of  vowel  /a/.  Nevertheless, 
these  estimated  pole  values  still  provide  good  indications  for  the 
general  spectral  trends  of  the  glottal  waves.  The  information  of  the 
glottal  low-pass  characteristics  is  not  only  useful  in  distinguishing 
different  voice  types,  it  can  also  be  used  to  determine  the  adaptive 
preemphasis  coefficients  in  many  speech  processing  applications,  where 
a normalization  of  speech  spectral  flatness  is  desired  [Markel  and 
Gray,  1976]. 

Frokjaer-Jensen  and  Prytz  [1976]  defined  a parameter,  a,  to 
measure  the  intensity  ratio  between  the  lower  and  higher  frequency 
regions  of  a speech  signal. 


intensity  above  1000  Hz 

a = 

intensity  below  1000  Hz 


(3-11) 


Based  on  their  experimental  results,  the  authors  reported  that  the 
parameter  a seems  to  be  a good  acoustic  correlate  to  the  physiological 
term  "vocal  fold  medial  compression,"  i.e.,  a voice  with  low  vocal 
effort  has  a low  a-value,  while  a voice  with  high  vocal  effort  has  a 
high  a-value. 
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The  parameter  a can  also  be  applied  in  the  present  study  to 
measure  the  energy  distribution  for  speech  samples  of  various  voice 
types.  But,  this  parameter  has  a major  disadvantage;  its  value  is 
greatly  affected  by  the  specific  formant  structure  of  the  particular 
vowel  sample.  To  overcome  this  problem,  Frokj aer-Jensen  and  Prytz 
[1976]  computed  a-values  for  long-time-average-spectra  (LTAS).  This 
approach  requires  considerable  computation. 

Interharmonic  Noise 

Turbulent  noise  at  the  glottis  was  found  to  contribute  to  the 
perceptual  quality  of  breathiness  [Isshiki  et  al.,  1978;  Yumoto  and 
Gould,  1982;  Hillman  et  al.,  1983;  Hiraoka  et  al.,  1984].  To  estimate 
the  amount  of  the  turbulent  noise  contained  in  a speech  signal,  both 
time-domain  and  frequency-domain  methods  have  been  proposed.  Yumoto 
and  Gould  [1982]  measured  the  differences  between  the  individual 
periods  and  the  average  speech  waveform.  This  method  is  conceptually 
straightforward.  But,  for  an  accurate  estimation,  the  speech  signal 
must  be  sampled  at  a very  high  sampling  rate  (Yumoto  and  Gould  [1982] 
used  a sampling  rate  of  20  kHz).  Besides,  extreme  care  must  be  taken 
to  time-align  the  data  when  performing  this  waveform  subtraction 
method.  Hiraoka  et  al.  [1984]  tackled  the  problem  in  the  frequency 
domain.  They  performed  a fast  Fourier  transform  (FFT)  on  a segment  of 
speech  samples  (about  0.2  seconds),  separated  the  speech  energy  into 
harmonic  and  interharmonic  noise  components,  and  computed  the  relative 
harmonic  intensity. 

In  this  study,  we  adapted  the  frequency-domain  method  proposed  by 
Hiraoka  et  al.  [1984],  and  incorporated  a procedure  to  estimate  the 
fundamental  frequency  (F0)  with  high  accuracy,  which  is  crucial  for 
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identifying  the  higher  harmonic  components.  Figure  3-16  shows  the 
block  diagram  of  our  harmonic-versus-noise  analysis.  The  intensity 
spectrum  of  a sustained  vowel  signal  was  obtained  by  FFT  of  2048  sample 
points  (204.8  msec);  a Hamming  window  was  used.  To  identify  the 
locations  of  the  harmonic  components,  an  accurate  estimate  of  F0  is 
required.  This  was  done  by  a two-step  procedure.  Firstly,  we  estimate 
the  pitch  periods  of  the  speech  signal  by  using  the  synchronous  EGG 
signal  (see  next  section).  Since  the  pitch  periods  were  estimated  in 
number  of  sample  points,  the  accuracy  of  this  initial  estimation  was 
restricted  by  the  signal  sampling  rate  (in  our  case,  10  kHz).  For 
example,  if  the  accuracy  of  the  pitch  estimation  is  within  one  sample 
point,  a pitch  estimate  of  100  sample  points  means  an  F0  of  100  ± 1 Hz, 
and  a pitch  estimate  of  50  sample  points  means  an  F0  of  200  ± 4 Hz. 
Such  estimation  errors  can  cause  a severe  problem  in  identifying  high- 
order  harmonics.  For  example,  a 4-Hz  error  of  F0  could  lead  to  a 40-Hz 
shift  in  locating  the  10th  harmonic. 

An  adaptive  F0  correction  procedure  was  used  to  reduce  the  error 
in  the  initial  F0  estimate.  We  defined  F0n  = F0  ± n*AF,  where  n = 1, 
2,..., 10,  and  AF  is  one  tenth  of  the  maximum  possible  error  of  F0, 
as  described  above.  Based  on  each  F0n,  the  energy  of  the  harmonic 
components  over  the  frequency  range  of  0 to  2 KHz  was  computed.  The 
value  F0n  that  gave  the  maximum  harmonic  energy  was  selected  as  the 
final  estimate  of  F0.  This  criterion  was  used  since  our  experimental 
results  showed  that,  over  the  low-frequency  range  of  0 to  2 KHz,  the 
harmonic  energy  is  considerably  higher  than  the  interharmonic  noise 
energy. 

Once  F0  was  determined,  the  ith  harmonic  intensity  H^  and  the 
interharmonic  noise  Nj  were  computed  at  the  frequency  region  i-F0  ± 


Speech 

Samples 

s(n) 


EGG 

Signal 


Figure  3-16.  Block  diagram  for  computing  noise-to-harmonic  ratio  (NHR). 
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FO/2,  where  represents  the  energy  in  the  subregion  centered  at  i‘F0 
with  bandwidth  of  the  used  Hamming  window  and  Nj  represents  the  energy 
in  the  rest  of  the  frequency  region  (Figure  3-17).  And  the  noise-to- 
harmonic  ratio  at  the  ith  harmonic  region  (NHR^)  was  defined  as: 

Ni 

NHRi  = (3-12) 

Hi 

The  distribution  of  the  interharmonic  noise  was  investigated  by 
plotting  NHR^  along  the  frequency  axis  (Figure  3-18).  As  can  be  seen, 
the  voices  judged  to  be  breathy  have  much  higher  interharmonic  noise 
above  about  2 KHz.  Based  on  this  observation,  we  felt  it  would  be 
appropriate  to  use  the  noise-to-harmonic  ratio  over  a high-frequency 
range  as  an  indication  for  the  voice  quality  of  breathiness.  The  NHR^ 
was  thus  defined  as 

I Nj 

NHRh  = (3-13) 

E Hi 

where  {N^}  and  {Hi ) are  the  noise  and  harmonic  components  above  2 KHz, 
respectively.  The  overall  NHR,  which  was  defined  over  the  whole 
frequency  range  (0-5  KHz),  was  also  computed  for  comparison.  Table  3-5 
lists  the  analysis  results  for  samples  of  various  voice  types.  As  can 
be  seen,  the  NHR^  (noise-to-harmonic  ratio  over  2 KHz)  is  much  better 
for  predicting  the  existence  of  breathy  quality. 

Temporal  Energy  distribution 

The  temporal  energy  distribution  of  a speech  waveform  is  related 
to  the  phase  characteristics  of  the  glottal  excitation  and  has  long 
been  found  to  affect  the  perceptual  quality.  Uendahl  et  al.  [1963] 
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Figure  3-17.  Frequency  intervals  for  harmonic  and  interharmonic  noise 
components . 
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(a)  modal  voice  (DMH3) 


(IB) 

140 

130 

120 

110 

100 

90 

80 

70 

60 

50 

40 


Figure  3-18.  The  FFT  spectra  and  noise-to-harmonic  ratio  (NHR^)  for 
(a)  modal  voice  (DMH3),  (b)  Falsetto  (DMH5),  (c)  mimicked  breathy 
voice  (DMH14),  and  (d)  patient's  breathy  voice  (EDR1). 
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(Hz) 


Figure  3-18.  (Continued) 
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(c)  mimicked  breathy  voice  (DMH14) 


(EB) 


(dB) 


Figure  3-18.  (Continued) 
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(dB) 


Figure  3-18.  (Continued) 


Table  3-5.  Noise-to-harmonic  ratios. 


Subject 

Voice  Type 

Hi-Freq  NHR 

Overall  NHR 

DMH1 

modal 

-5.1  dB 

-13.8  dB 

DMH2 

modal 

-4.2 

-16.5 

GPM1 

modal 

-4.2 

-14.8 

CKL1 

modal 

-5.4 

-14.8 

CKL2 

modal 

-5.6 

-16.6 

HBR1 

modal 

-4.8 

-16.9 

HRB2 

modal 

-6.2 

-15.1 

DRW1 

modal 

-7.1 

-14.9 

DRW2 

modal 

-5.1 

-16.2 

DMH5 

falsetto 

-8.5 

-20.4 

DMH6 

falsetto 

-8.2 

-20.3 

DMH7 

falsetto 

-9.9 

-23.2 

DMH8 

falsetto 

-8.4 

-22.1 

CKL4 

falsetto 

-4.3 

-21.7 

CKL5 

falsetto 

-3.9 

-20.1 

CK.L6 

falsetto 

-6.1 

-22.1 

CKL7 

falsetto 

-3.3 

-18.5 

DMH14 

breathy  (mimic) 

-0.7 

-12.6 

GPM2 

breathy  (mimic) 

5.5 

-12.7 

EDR1 

breathy  (path.) 

-0.4 

-19.6 

JMS1 

breathy  (path.) 

0.2 

-19.8 

PJB1 

breathy  (path.) 

5.9 

-14.8 

PJB2 

breathy  (path.) 

6.2 

- 8.9 
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first  reported  that  the  primary  criterion  that  must  be  met  for  the 
perception  of  a signal  as  vocal  fry  is  that  the  signal  be  highly  damped 
between  glottal  excitations.  Coleman  [1963]  further  established  that 
vocal  fry  is  perceived  when  the  sound  wave  decays  by  approximately  43 
dB  between  excitation  pulses.  (However,  no  detail  was  given  about  how 
to  calculate  that  figure.)  A typical  waveform  of  the  vocal  fry  vowel 
/ i / is  shown  in  Figure  3-19-(a).  Along  with  it  in  Figure  3-19-(b)  is 
the  waveform  of  a modal  voice  of  the  same  vowel  phonated  by  the  same 
subject.  The  example  clearly  shows  the  rapid  decay  characteristic  of 
the  vocal  fry  waveform.  Another  related  example  is  reported  by  Sambur 
et  al.  [1978].  Based  on  their  experiments,  they  concluded  that  the 
peaky  waveform  of  an  impulse  excited  LPC  synthetic  speech  (Figure  3-20) 
is  the  major  reason  for  the  voice  quality  of  "buzziness". 

To  measure  the  decay  characteristics  of  a speech  waveform  during  a 
single  pitch  period,  a parameter  called  waveform  peak  factor  (WPF)  was 
defined  as 

peak  amplitude  MAX  ( |x<  | ) 

WPF  = = (3-14) 

rms  value  IN  1/2 

( E xj2  ) 

N i=l 

where  x^  is  the  amplitude  in  the  ith  sample  point  and  N is  the  total 
number  of  sample  points  in  one  pitch  period.  Theoretically,  the 

waveform  peak  factor,  WPF,  has  a minimum  value  1 when  the  waveform  is 
flat,  and  a maximum  value  N when  the  waveform  is  an  impulse.  The 

WPF  value  of  a speech  waveform  is  related  to  the  underlying  glottal 
waveshape.  For  glottal  waves  with  narrow  pulses  separated  by  long 
glottal  closure,  the  WPF  value  is  high,  and  vice  versa. 
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(a)  vocal  fry  speech  waveform  (vowel  /i/) 


(b)  modal  speech  waveform  (vowel  /i/) 


Figure  3-19.  Speech  waveforms  of  (a)  vocal  fry,  (b)  modal  voice. 


78 


(a)  natural  speech  waveform  (vowel  / i / ) 


Figure  3-20.  (a)  Natural  speech  waveform  and  (b)  the  corresponding 

impulse  excited  LPC  synthetic  speech  waveform. 
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The  average  VPF  values  for  sustained  vowels  of  various  voice  types 
are  listed  in  Table  3-6.  The  results  shoved  that  the  VPF  value  of  a 
speech  sample  is  affected  by  the  specific  vowel  of  that  speech  sample. 
As  shown  in  Table  3-6,  for  the  same  subject  and  the  same  voice  type, 
the  VPF  value  for  vowel  /a/  is  usually  slightly  higher  that  that  for 
vowel  / i / . Nevertheless,  the  general  rule  is  that  the  speech  samples 
of  vocal  fry,  modal,  and  falsetto  registers  are  characterized  by  high, 
medium,  and  low  VPF  values,  respectively.  Thus,  our  results  implied 
that  the  vocal  fry  register  is  characterized  by  a pulse-like  excitation 
wave  with  a long  glottal  closure,  while  the  falsetto  register,  on  the 
contrary,  is  associated  a short  glottal  closure. 

EGG  Waveform  Features 

The  electroglottographic  signal  (EGG)  is  generally  believed  to 
be  representative  of  the  amount  of  the  lateral  contact  between  the 
vocal  folds.  This  measurement  is  attractive  because  it  provides  an 
non-invasive  method  to  observe  the  activity  of  vocal  fold  vibration. 
However,  due  to  the  fact  that  the  vocal  folds  undergo  complex  three- 
dimensional  movements  during  vibration,  the  information  carried  by  an 
EGG  signal  (with  only  one  amplitude  dimension)  is  not  straightforward 
for  interpretation.  Childers  et  al.  [1983]  and  Krishnamurthy  [1983] 
studied  the  characteristics  of  EGG  waveforms  by  observing  the 
synchronized  ultra  high-speed  laryngeal  films.  They  concluded  that 
the  EGG  and  differential  EGG  (DEGG)  waveforms  are  useful  in  registering 
the  glottal  events  during  vocal  fold  vibration.  When  the  EGG  waveform 
is  so  arranged  such  that  an  upward  deflection  reflects  the  opening 
of  the  glottis  and  a downward  deflection  depicts  the  closing  of  the 
glottis,  the  sharp  negative  peaks  in  the  DEGG  waveform  were  found  to 
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Table  3-6.  Waveform  peak  factors  (WPF)  for  speech  samples 
of  various  voice  types 


Subject 

Voice  Type 

F0  (Hz) 

Vowel 

WPF 

DMH1 

modal 

115 

/i/ 

2.47 

DMH2 

modal 

123 

/a/ 

2.93 

CKL1 

modal 

118 

/i/ 

3.04 

CKL2 

modal 

112 

/a/ 

3.21 

HRB1 

modal 

142 

/i/ 

2.45 

HRB2 

modal 

141 

/a/ 

2.75 

DRW1 

modal 

129 

/i/ 

2.30 

DRW2 

modal 

128 

/a/ 

2.81 

DMH9 

slight  vocal  fry 

90 

/i/ 

3.79 

DMH10 

slight  vocal  fry 

91 

/a/ 

3.55 

DMH13 

severe  vocal  fry 

45 

/i/ 

5.29 

CKL8 

severe  vocal  fry 

50 

/i/ 

3.40 

CKL9 

severe  vocal  fry 

48 

/a/ 

3.97 

DMH5 

falsetto 

306 

/i/ 

1.58 

DMH6 

falsetto 

312 

/a/ 

2.12 

CKL4 

falsetto 

263 

/i/ 

1.49 

CKL5 

falsetto 

238 

/a/ 

2.17 
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be  very  close  to  the  instants  of  glottal  closure  (if  they  exist),  and 
the  maxima  of  the  DEGG  are  indications  of  the  glottal  opening  (Figure 
3-21).  In  this  study,  we  investigated  the  EGG  waveform  features  for 
various  types  of  phonation. 

Figure  3-22  shows  the  typical  EGG  and  DEGG  waveforms  of  modal, 
vocal  fry,  falsetto,  and  breathy  phonations.  For  all  these  phonation 
types,  the  EGG  waveforms  exhibit  a steeper  change  of  slopes  (implying  a 
rapid  change  of  vocal  fold  contact  areas)  during  closing  phases.  This 
characteristic  phenomenon  of  vocal  fold  vibration  results  in  a sharp 
negative  pulse  in  the  DEGG  waveform  at  the  instant  of  maximum  closing 
slope  over  each  pitch  period.  This  result  is  consistent  with  that 
derived  from  the  inverse  filtered  glottal  waves  (Figure  3-9),  which 
also  shows  a steeper  change  during  closing  phases.  Aside  from  this 
common  feature,  distinctive  EGG  and  DEGG  waveform  patterns  (implying 
different  vocal  fold  contact  phenomena)  were  found  to  characterize 
different  types  of  phonation.  For  modal  and  vocal  fry  phonations 
(Figure  3-22-(a)  to  (d)),  during  each  pitch  period,  the  instant  of 
maximum  closing  slope  in  the  EGG  waveform  is  close  to  its  minimum 
extension,  and  thus  results  in  a very  narrow  negative  pulse  in  the  DEGG 
waveform.  This  implies  that  the  instant  of  DEGG  negative  peak  is  near 
the  occurrence  of  glottal  closure,  as  observed  by  Childers  et  al. 
[1983]  and  Krishnamurthy  [1983].  The  falsetto  DEGG  waveforms  (Figure 
3-22-(e)  and  (f)),  on  the  other  hand,  show  much  wider  negative  pulses, 
and  thus  do  not  provide  good  indications  for  glottal  closure.  For  the 
breathy  EGG  waveforms  (Figure  3-22-(g)  and  (h)),  each  closing  branch  is 
closely  followed  by  the  next  opening  branch,  implying  a momentary  or 
absence  of  glottal  closure.  One  of  the  vocal  fry  EGG  waveforms  (Figure 
3-22-(d))  also  clearly  shows  the  double  opening/closing  pattern  during 
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(a)  EGG 


(b)  DEGG 


closed 
< — phase  — ► 


Figure  3-21.  Glottal  phase  detection  using  EGG  and  DEGG  waveforms. 
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(a)  modal  voice  (DMH1,  30  ms) 


(b)  modal  voice  (CKL1,  30  ms) 


(c)  vocal  fry  (CKL11,  80  ms) 


(d)  vocal  fry  (DMH13 , 80  ms) 


Figure  3-22.  The  EGG  (upper)  and  DEGG  (lover)  waveforms  of  various 
voice  types. 
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(e)  falsetto  (CK.L4,  16  ms) 


(g)  breathy  voice  ( JMS2 , 25  ms) 


Figure  3-22.  (Continued) 


85 


an  individual  voice  cycle,  as  observed  by  other  researchers  [Moore  and 
von  leden,  1958;  Timcke  et  al.,  1959;  Whitehead  et  al.,  1984],  In  the 
following  sections,  we  present  parameters  that  were  defined  to  describe 
EGG  waveform  patterns. 

Vocal  Fold  Contact  Characteristics 

As  a measure  of  vocal  fold  contact  area,  the  EGG  waveform  is 
considered  to  be  particularly  useful  in  registering  the  vocal  fold 
vibration  characteristics  during  the  glottal  closing  and  closed  phases. 
We  have  studied  vocal  fold  contact  phenomena  by  using  our  synchronized 
EGG  waveforms  and  ultra  high-speed  laryngeal  films.  For  a normal  male 
phonation  in  modal  or  vocal  fry  register,  the  closed  phase  typically 
begins  with  the  vocal  folds  making  contact  along  the  entire  midsagittal 
line  of  the  lower  edges  of  the  folds,  which  results  in  a rapid  decrease 
in  the  EGG.  The  vocal  fold  closure  continues  along  the  thickness 
dimension  toward  the  upper  margin  as  the  folds  roll  upward.  The 
elastic  collision  of  the  vocal  folds  causes  a rounding  curve  at  the 
EGG's  minimum  extension  (Figure  3-22-(a)  to  (d)).  On  the  other  hand, 
for  a high-pitched  falsetto  or  a breathy  phonation,  the  complete 
glottal  closure  only  lasts  for  a very  short  interval  or  is  totally 
absent;  the  closing  phase  is  immediately  followed  by  a vocal  fold 
opening  phase.  The  EGG  waveform  in  this  case  is  characterized  by  a 
sharp  angle  at  it's  minimum  extension  (Figure  3-22-(e)  to  (h)). 

We  defined  parameters  on  the  EGG  waveform  to  predict  these 
different  glottal  closure  phenomena.  Four  types  of  sharpness  factors 
(SF)  were  examined.  Referring  to  Figure  3-23,  the  lines  and  L2  are 
the  tangents  of  the  EGG  waveform  at  the  instants  of  maximum  falling 
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Figure  3-23.  Graphic  illustration  for  the  definition  of  EGG  waveform 
sharpness  facters  (SF1  to  SF4). 
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slope  and  maximum  rising  slope,  respectively.  The  parameters  were 


defined  as 


d2 


SF1 


(3-15) 


dl 


d3 


SF2  = 


(3-16) 


d4 


SF3  = 0.5  (SF1  + SF2) 


(3-17) 


d2  + d3 


SF4  = 


(3-18) 


dl  + d4 


where  dl  to  d4  are  distances  as  indicated  in  Figure  3-23.  Among  the 
four  parameters  defined  above,  SF1  is  a measure  of  the  closing  rate 
before  the  maximum  vocal  fold  contact,  while  SF2  is  a measure  of  the 
opening  rate  after  the  maximum  vocal  fold  contact,  and  SF3  and  SF4  are 
two  forms  of  the  average  of  the  closing  and  opening  rates. 

The  results  of  these  four  parameters  for  the  EGG  waveforms  of 
various  voice  types  are  listed  in  Table  3-7  and  are  plotted  in  Figure 
3-24.  As  can  be  seen,  a general  rule  is  that  the  modal  and  vocal  fry 
EGG  waveforms  have  small  SF  values,  while  the  falsetto  and  breathy 
EGG  waveforms  are  characterized  by  high  SF  values.  There  are  a few 
exceptions  for  SF1  and  SF2,  but  for  the  average  sharpness  factors  SF3 
and  SF4,  all  the  samples  tested  comply  with  the  rule.  Figure  3-24  also 
reveals  that  SF4  is  the  best  parameter  to  distinguish  a falsetto  or 
breathy  phonation  from  the  modal  phonation. 

Glottal  Phase  Detection 


Using  the  techniques  proposed  by  Childers  et  al.  [1983]  and 
Krishnamurthy  [1983],  we  estimated  the  pitch  period  (PP)  and  the 


Table  3-7.  Sharpness  factors  for  EGG  waveforms. 


Subject 

Voice  Type 

SF1 

SF2 

SF3 

SF4 

DMH1 

modal 

.22 

.31 

.27 

.26 

DMH2 

modal 

.19 

.34 

.27 

.24 

GPM1 

modal 

.33 

.28 

.30 

.29 

CKL1 

modal 

.21 

.37 

.29 

.27 

CKL2 

modal 

.21 

.50 

.36 

.30 

HBR1 

modal 

.23 

.30 

.26 

.27 

HBR2 

modal 

.24 

.41 

.32 

.31 

DMH9 

slight  vocal  fry 

.18 

.39 

.28 

.24 

DMH10 

slight  vocal  fry 

.22 

.35 

.29 

.27 

DMH12 

slight  vocal  fry 

.25 

.40 

.33 

.31 

CKL10 

severe  vocal  fry 

.24 

.36 

.30 

.30 

CKL11 

severe  vocal  fry 

.23 

.40 

.32 

.30 

DMH5 

falsetto 

.34 

.57 

.45 

.41 

DMH6 

falsetto 

.45 

.54 

.49 

.48 

CKL4 

falsetto 

.39 

.60 

.50 

.44 

CKL5 

falsetto 

.39 

.59 

.49 

.45 

CKL6 

falsetto 

.40 

.55 

.47 

.48 

CKL7 

falsetto 

.60 

.68 

.64 

.62 

DMH14 

breathy  (mimic) 

.41 

.38 

.40 

.40 

GPM2 

breathy  (mimic) 

.35 

.62 

.48 

.45 

PJB1 

breathy  (path.) 

.31 

.48 

.40 

.38 

PJB2 

breathy  (path.) 

.31 

.42 

.37 

.36 

EDR1 

breathy  (path.) 

.23 

.48 

.36 

.34 

EDR2 

breathy  (path.) 

.32 

.49 

.40 

.39 

JMS1 

breathy  (path.) 

.67 

.57 

.62 

.59 

JMS2 

breathy  (path.) 

.44 

.57 

.51 

.49 
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Figure  3-24.  Sharpness  factors  for  EGG  waveforms  of  various  voice 
types. 
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glottal  open  quotient  (OQ)  for  various  voice  types.  The  pitch  period 
was  estimated  by  the  time  duration  between  two  successive  negative 
peaks  in  the  DEGG  waveform.  The  open  quotient  was  defined  as 

duration  of  the  open  phase 

OQ  = (3-19) 

pitch  period 

where  the  open  phase  was  estimated  by  the  time  duration  between  a 
positive  peak  and  the  next  adjacent  negative  peak  in  the  DEGG  waveform 
(Figure  3-21). 

The  average  PP  and  OQ  values  estimated  from  the  DEGG  waveforms 
of  various  voice  types  are  listed  in  Table  3-8.  We  noticed  that  the 
OQ  values  based  on  the  EGG  waveforms  are  generally  lower  than  those 
estimated  from  the  inverse  filtered  glottal  waves.  Especially  for 
falsetto  and  breathy  voices  with  progressive  glottal  closure,  the 
EGG-based  technique  underestimated  the  OQ  values.  Nevertheless,  the 
results  still  showed  that  the  ranking  in  order  of  increasing  OQ-values 
were  vocal  fry,  modal,  falsetto,  and  breathy  voices. 

One  of  the  important  advantages  of  using  EGG,  a non-invasive 
method,  to  register  the  vocal  fold  vibration  is  that  there  is  almost  no 
interference  to  normal  speaking,  and  hence,  the  dynamic  characteristics 
of  the  glottal  movements  can  be  studied  by  analyzing  the  EGG  waveforms 
of  a running  speech.  Figure  3-25  shows  an  example:  the  EGG-based  pitch 
period  and  open  quotient  contours  being  displayed  along  with  the  speech 
waveform  of  the  sentence  "Should  we  chase  those  cowboys?"  As  can  be 
seen,  at  the  final  part  of  each  word  the  open  quotients  increase, 
reflecting  the  fact  that  the  voice  becomes  soft  due  to  less  vocal 
effort  at  word  endings.  Another  example,  shown  in  Figure  3-26,  is  a 
phonation  of  down-going  chromatic  scales  for  vowel  /a/.  This  example 
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Table  3-8.  The  average  pitch  periods  (in  number  of  sample  points) 
and  open  quotients  estimated  from  EGG  signals. 


Subject 

Voice  Type 

TASK 

PP 

0Q 

DMH1 

modal 

/i/ 

86.4 

.58 

DMH2 

modal 

/a/ 

80.6 

.63 

CKL1 

modal 

/i/ 

84.7 

.44 

CKL2 

modal 

/a/ 

88.2 

.55 

GPM1 

modal 

/i/ 

78.8 

.56 

HBR1 

modal 

/i/ 

70.2 

.50 

HBR2 

modal 

/a/ 

70.8 

.57 

DRW1 

modal 

/i/ 

77.3 

.52 

DRW2 

modal 

/a/ 

78.4 

.50 

DMH9 

slight  vocal  fry 

/i/ 

110.4 

.60 

DMH10 

slight  vocal  fry 

/a/ 

109.4 

.56 

CKL10 

severe  vocal  fry 

/!/ 

191.5 

.19 

CKL11 

severe  vocal  fry 

/a/ 

186.0 

.25 

DMH5 

falsetto 

/i/ 

32.7 

.70 

DMH6 

falsetto 

/a/ 

32.5 

.72 

CKL4 

falsetto 

/i/ 

38.3 

.69 

CKL5 

falsetto 

/a/ 

41.9 

.68 

DMH14 

breathy  (mimic) 

/i/ 

93.2 

.81 

GPM2 

breathy  (mimic) 

/i/ 

90.7 

.60 

JMS1 

breathy  (path.) 

/i/ 

46.7 

.82 

JMS2 

breathy  (path.) 

/i/ 

50.5 

.80 

EDR1 

breathy  (path.) 

/i/ 

75.1 

.42 

EDR2 

breathy  (path.) 

/i/ 

74.6 

.51 
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(a)  Speech  Waveform 


Figure  3-25.  Pitch  period  and  open  quotient  contours  estimated  from  the 
EGG  signal  of  a sentence. 
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(a)  Speech  Waveform 


(b)  Pitch  Period 

(msec) 


(c)  Open  quotient 


2.8  seconds 


Figure  3-26.  Pitch  period  and  open  quotient  contours  estimated  from  the 
EGG  signal  of  a phonation  of  down-going  chromatic  scale  on  "la". 
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shows  that  the  open  quotients  decrease  with  increasing  pitch  periods, 
except,  at  the  end  of  the  phonation,  the  open  quotients  increase  due  to 
less  vocal  effort,  as  described  above. 

Summary.  Table  3-9  summarizes  the  analysis  results  derived  in  • 
this  chapter,  which  shows  a great  diversity  in  temporal  and  spectral 
characteristics  for  glottal  excitations  of  different  voice  types. 
These  results  helped  us  to  understand  the  nature  and  extent  of  the 
variations  of  the  human  vocal  excitation.  In  the  next  chapter,  we 
report  how  the  knowledge  gained  was  used  to  improve  the  design  of 
voice  excitations  for  producing  natural-sounding  synthetic  speech  with 
desired  voice  characteristics. 
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Table  3-9.  Summary  of  the  source-related  features. 


VOICE  TYPE 

FEATURE 

MODAL 

VOICE 

VOCAL  FRY 

FALSETTO 

BREATHY 

VOICE 

GLOTTAL 

VAVEFORM 

Pulse  Width 

medium 

very  short 

long 

long 

Pulse  Skewing 

medium 

high 

low 

low 

Abruptness 
of  Closure 

abrupt 

closure 

very  abrupt 
closure 

progressive 

closure 

progressive 

closure 

GLOTTAL 

SPECTRUM 

Spectral 

Slope 

medium 

flatter 

steep- 

falling 

steep- 

falling 

Harmonic 

Richness 

Factor 

medium 

high 

low 

low 

SPEECH 

FEATURES 

Spectral 

Tilt 

medium 

flatter 

Steep- 

Falling 

Steep- 

Falling 

Turbulent 

Noise 

low 

low 

low 

high 

Waveform 
Peak  Factor 

medium 

high 

low 

low 

EGG 

VAVEFORM 

FEATURES 

Open 

Quotient 

medium 

low 

high 

high 

Closure 

Sharpness 

Factor 

low 

low 

high 

high 

CHAPTER  4 


VOICE  SYNTHESIS  AND  SOURCE  MODELING 

The  basic  task,  of  a speech  synthesizer  is  to  produce  speech 
utterances  that  can  be  understood  by  the  human  users.  Thus,  without 
doubt,  intelligibility  (phonetic  identif iability)  of  the  synthetic 
speech  is  of  paramount  importance.  However,  subjective  factors  such 
as  quality  and  naturalness  of  the  speech  utterances  also  have  an 
effect  on  the  usefulness  and  acceptability  of  a voice  synthesizer. 
Researchers  [Rosenberg,  1971;  Holmes,  1973;  Childers  et  al.,  1987; 
Childers  and  Wu,  1988]  have  been  interested  in  factors  responsible  for 
quality  and  naturalness  of  synthetic  speech.  And  it  is  now  commonly 
agreed  that  the  perceptual  quality  and  naturalness  of  the  synthetic 
speech  can  be  improved  by  using  an  appropriate  source  model  during 
voiced  segments  of  speech. 

In  this  chapter,  we  discuss  the  factors  that  affect  the  design  of 
source  models  for  speech  synthesis.  The  variations  of  glottal  source 
excitations  due  to  different  types  of  voice  production  were  studied. 
The  knowledge  gained  provided  an  objective  foundation  for  evaluating 
existing  glottal  waveform  models  and  was  used  in  designing  an  improved 
source  model.  Our  ultimate  practical  goal  is  to  develop  a source  model 
and  synthesis  rules  that  improve  the  naturalness  of  synthetic  speech 
and  enable  us  to  cope  with  different  voice  types  and  speaking  styles. 
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Voice  Synthesis  Based  on  Source-Filter  Theory 

A critical  step  in  the  history  of  speech  synthesis  was  the 
development  of  the  source-filter  theory  of  speech  production  [Fant, 
I960].  Widely  used  voice  synthesizers,  including  the  formant 
synthesizers  [Rabiner,  1968;  Holmes,  1973,  1983;  Klatt,  1980,  1987]  and 
the  LPC  synthesizers  [Atal  and  Hanauer,  1971;  Markel  and  Gray,  1976] 
are  based  on  this  theory.  In  its  simplest  form,  this  theory  states 
that  it  is  possible  to  view  speech  as  the  outcome  of  the  excitation  of 
a linear  filter  by  one  or  more  sound  sources.  The  primary  sources  of 
sound  are  voicing  caused  by  the  vibration  of  the  vocal  folds,  and  the 
turbulent  noise  caused  by  a pressure  difference  across  a constriction. 
The  linear  filter  simulates  the  resonance  effects  of  the  vocal  tract 
formed  by  the  pharynx  and  the  oral  cavity  (and,  sometimes,  the  nasal 
cavity).  In  this  research,  we  studied  the  factors  that  affect  the 
design  and  selection  of  the  source  excitation  for  natural-sounding 
voice  synthesis. 

Synthesis  Model  Factors 

As  described  in  the  previous  chapter,  three  factors  determine  the 
overall  spectral  envelope  of  a voiced  speech  signal:  (1)  the  transfer 
function  of  the  vocal  tract,  (2)  the  effect  of  radiation  at  the  lips 
and  nostrils,  and  (3)  the  spectrum  of  a single  glottal  pulse.  The  main 
problems  in  speech  synthesis  are  associated  with  factors  (1)  and  (3). 
The  effect  of  the  sound  radiation  from  the  lips  and  nostrils  can 
be  represented  over  most  of  the  speech  frequency  range  by  a simple 
differentiation  filter  [Fant,  1960],  which  can  be  provided  directly  or 
its  effect  can  be  combined  with  some  other  function  of  the  synthesizer. 
For  the  voice  synthesizers  based  on  the  source-filter  theory,  it  is, 
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in  fact,  quite  acceptable  to  combine  some  aspects  of  factors  (1), 
(2)  and  (3)  above  in  the  spectrum-shaping  filter  system.  Due  to 
the  differences  in  synthesis  models  or  filter  configurations,  the 
corresponding  source  excitation  must  satisfy  different  requirements. 

Cascade  formant  synthesis.  It  has  been  established  [Fant,  1960] 
that  if  the  vocal  tract  is  considered  as  a non-uniform  unbranched 
acoustic  tube,  excited  entirely  at  the  glottis  and  radiating  sound 
only  from  the  mouth  (implying  that  for  the  frequency  range  of  interest 
the  sound  transmission  in  the  tube  is  in  the  form  of  plane  waves), 
then  the  vocal  tract  transfer  function  can  be  approximated  by  a set 
of  formant  resonators  connected  in  cascade.  The  relative  amplitudes 
of  the  formant  peaks  in  this  case  (e.g.,  vowels)  are  correct  and 
require  no  individual  adjustments  for  each  formant.  Thus,  for  a 
cascade  formant  synthesizer,  only  the  vocal  tract  transfer  function  is 
simulated  by  the  spectrum-shaping  filter  system.  All  the  temporal  and 
spectral  characteristics  associated  with  the  glottal  volume  flow  need 
to  be  accounted  for  by  the  source  excitation.  Thus,  for  faithfully 
reproducing  the  original  voice  characteristics,  a cascade  formant 
synthesizer  requires  an  excitation  wave  that  resembles  the  natural 
glottal  volume  flow.  This  usually  is  not  an  easy  task.  But  due  to 
the  separate  simulation  of  the  vocal  tract  transfer  function  and  the 
glottal  source  excitation,  the  cascade  formant  synthesizers  are  useful 
for  perceptual  research,  whether  they  are  used  to  investigate  the 
tract-related  parameters  or  the  source-related  parameters. 

Parallel  formant  synthesis.  A parallel  formant  synthesizer 
requires  a gain  control  for  each  formant  resonator  to  determine  the 
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individual  formant  amplitudes  [Rabiner,  1968;  Holmes,  1973,  1983; 
Klatt,  1980,  1987].  This  requirement  increases  the  complexity  of  a 
parallel  formant  synthesizer.  But  it  has  the  advantage  of  reproducing 
the  overall  speech  spectra  by  independent  control  of  formant 
amplitudes.  In  other  words,  a parallel  connected  resonator  system  can, 
by  itself,  simulate  the  vocal  tract  resonance  as  well  as  the  spectral 
trend  of  the  glottal  pulse  and  the  lip  radiation.  Thus,  the  source 
excitation  of  a parallel  synthesizer  may  have  a flat  spectral  trend. 
Holmes  [1973]  recommended  an  excitation  signal  based  on  the  second 
time-derivative  of  a typical  glottal  volume-velocity  waveform.  For 
glottal  pulses  with  a -12  dB/octave  high-frequency  trend  the  twice- 
differentiated  signal  will  have  an  approximately  flat  spectrum.  But, 
as  demonstrated  in  the  previous  chapter,  the  spectral  characteristics 
of  the  human  glottal  excitation  vary  greatly  for  different  types  of 
voice  production.  To  synthesize  natural-sounding  speech  with  various 
voice  characteristics,  a more  sophisticated  spectral-flattening 
procedure  is  required. 

LPC  Synthesis.  The  basic  assumption  for  the  LP  model-based  speech 
synthesis  is  that  the  speech  sounds  are  produced  as  a result  of 
acoustical  excitation  of  a linear  all-pole  (recursive)  filter  [Atal  and 
Hanauer,  1971].  This  is  equivalent  to  predicting  the  current  sample 
of  speech  by  a linear  combination  of  previous  samples,  hence  the  name 
linear  prediction.  Unlike  the  formant  synthesis,  the  method  of  LPC 
analysis-synthesis  does  not  explicitly  estimate  the  formant  parameters 
(frequency,  bandwidth,  and  amplitude)  but  rather  performs  an  envelope 
match  on  the  speech  spectrum.  The  combined  contributions  of  the 
glottal  flow,  the  vocal  tract,  and  the  lip  radiation  are  represented  by 


100 


a single  recursive  filter.  The  source  excitation  of  an  LPC  synthesizer 
is  primarily  used  to  account  for  the  effect  of  glottal  pulse 
periodicity.  And  the  basic  requirement  for  an  LPC  source  excitation  is 
to  have  an  essentially  flat  spectrum. 

An  accepted  practice  in  LPC  synthesis  is  to  use  a pitch-modulated 
impulse  train  as  the  excitation  for  voiced  speech  sounds.  However,  an 
impulse  train  excitation  completely  leaves  out  the  phase  characteristic 
of  the  glottal  source.  As  mentioned  in  the  previous  chapter,  it 
leads  to  a peaked  time  waveform  (see  Figure  3-16)  as  the  fundamental 
frequency  decreases,  and  thus  introduces  "buzziness"  in  the  synthetic 
speech.  Researchers  [Rosenberg,  1971;  Holmes,  1973,  1983;  Sambur  et 
al.,  1978;  Wong  and  Markel,  1978;  Naik,  1984;  Childers  and  Vu,  1988] 
have  proposed  several  forms  of  nonimpulse  excitations  that  retain 
certain  glottal  phase  characteristics.  They  reported  improvements  in 
the  perceptual  quality  and  naturalness  of  the  synthesized  speech. 

Schroeder  and  Atal  [1985]  proposed  a stochastic  approach  for 
determining  source  excitations  for  the  LPC  synthesizers.  Their 
technique  selects  an  optimum  innovation  sequence  from  a code  book  of 
stored  sequences  based  on  a given  fidelity  criterion.  Using  this 
technique,  high  quality  synthetic  speech  was  produced  at  a bit  rate  of 
4.8  kbits  per  sec.  This  technique  has  the  advantage  of  being  free  from 
the  necessity  of  making  a decision  about  voiced/unvoiced  excitation 
and  for  making  pitch  estimations.  But  its  obvious  drawback  is  the 
extremely  heavy  cost  for  computation.  Schroeder  and  Atal  [1985] 
reported  that  their  coding  procedure  took  125  sec  of  Cray-1  CPU  time  to 
process  1 sec  speech  signal. 
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Classification  of  Source  Excitations 

Three  approaches  were  used  to  derive  source  excitation  waves  for 
voice  synthesis,  especially  for  formant  synthesizers: 

(1)  an  impulse  train  with  a fixed  glottal  shaping  filter  [Rabiner, 
1968;  Flanagan,  1957;  Klatt,  1980,  1987], 

(2)  estimates  of  the  true  glottal  volume-velocity  waveform  by 
inverse  filtering  a natural  speech  signal  [Rosenberg,  1971, 
Holmes,  1973]  or  by  using  a glottal  area  function  [Yea,  1983]. 

(3)  stylized  excitation  waveforms,  usually  specified  by  parameters 
associated  with  the  events  of  vocal  fold  vibration  [Rosenberg, 
1971;  Fant,  1979;  Ananthapadmanabha,  1984;  Hedelin,  1984; 
Fant  et  al.,  1985;  Fujisaki  and  Ljungqvist,  1986;  Klatt, 
1987]. 

For  source  models  of  approach  (1),  the  fixed  glottal  shaping 
filters  were  designed  to  have  an  average  normal  glottal  spectrum.  For 
example,  Klatt  [1980]  used  an  impulse  train  filtered  by  a two-pole 
low-pass  filter,  which  has  a spectrum  that  falls  off  smoothly  at 
approximately  -12  dB/octave  above  50  Hz.  The  waveform  thus  generated 
does  not  have  the  same  phase  characteristics  as  a natural  glottal 
pulse,  nor  does  it  contain  spectral  dips  of  the  kind  that  often  appear 
in  natural  voicing.  Moreover,  its  fixed  characteristics  can  not 
provide  the  flexibility  to  synthesize  natural-sounding  speech  with 
various  voice  qualities. 

As  for  approach  (2),  though  an  accurately  estimated  glottal  wave 
can  faithfully  reproduce  the  original  voice  characteristics  [Holmes, 
1973;  Yea,  1983],  the  estimation  of  the  human  glottal  volume-velocity 
waveform  is  difficult  in  practice  and  has  many  restrictions  (e.g.,  the 
speech  signal  must  be  collected  by  a microphone  with  a high  quality 
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low-frequency  response,  as  described  in  Chapter  2).  In  addition,  this 
form  of  voice  source  is  not  suitable  for  the  application  in  which  a 
number  of  parameters  are  needed  to  be  pre-stored,  such  as  a text-to- 
speech  system. 

On  the  other  hand,  a well  designed  stylized  excitation  waveform 
model  is  easy  to  use  and  capable  of  producing  natural-sounding 
synthetic  speech.  These  advantages  make  it  a good  choice  for  both 
applications  of  vocoders  or  text-to-speech  systems.  Over  the  years  a 
number  of  glottal  waveform  models  have  been  proposed  [Rosenberg,  1971; 
Fant,  1979;  Hedelin,  1984;  Ananthapadmanabha,  1984;  Fant  et  al.,  1985; 
Fujisaki  and  Ljungqvist,  1986;  Klatt,  1987],  These  models  were 
essentially  designed  to  generate  a family  of  voicing  waveforms  with 
their  waveshape  resembling  the  natural  glottal  volume  flow.  In  the 
previous  chapter,  we  studied  the  glottal  factors  that  characterize 
different  types  of  voice  production.  In  the  next  section,  we  will 
discuss  the  significance  of  these  factors  for  voice  synthesis  and 
source  modeling.  Before  that,  for  the  convenience  of  the  discussion, 
the  formulas  of  two  typical  glottal  waveform  models,  namely,  Fant 
[1979]  model  and  LF  model  [Fant  et  al.,  1985]  are  given  below. 

Fant  model.  The  Fant  [1979]  model  requires  three  parameters  for  a 
glottal  wave  (Figure  4-1).  The  model  parameters  are: 

tp  = glottal  flow  peak  position. 

t^  = instant  of  glottal  closure. 

A = peak  value  of  glottal  flow. 


The  modal  waveshape  is  formulated  by: 
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Figure  4-1.  The  Fant  [1979]  model  of  glottal  wave  (U„(t)).  The 

corresponding  differential  glottal  wave  (Ug'(t))  is  also  presented. 
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1 

Ug(t)  = A [1  - cos(c^t)]  , for  0 < t < tp  (4-1) 

and 

Ug(t)  = A [K  cosc^(t-tp)  - K + 1]  , for  tp  < t < tc  (4-2) 

where  tc  is  the  pitch  period,  oc^=rt/tp  is  the  frequency  of  the  rising 
portion  of  the  pulse  defined  over  the  duration  of  the  segment  of  the 
rising  branch,  and  K specifies  the  steepness  of  the  falling  branch. 
The  value  of  K can  be  derived  by  inserting  Ug(t(j)=0  into  equation 
(4-2),  i.e., 

1 

K = (4-3) 

1 - cos[(^(td-tp)] 

For  the  extreme  value  K=  ®,  the  falling  branch  is  a step  function  from 
the  peak  to  closure.  And  when  K=0.5,  the  falling  branch  is  symmetrical 
with  the  rising  branch.  No  matter  the  value  of  K,  an  abrupt  closure 
occurs  at  instant  td. 

LF  model.  The  LF  model  [Fant  et  al.  1985]  requires  four  model 
parameters  for  a differential  glottal  wave  (Figure  4-2).  These  are: 

tp  = glottal  flow  peak  position. 
te  = instant  of  maximum  closing  rate. 

ta  = time  constant  of  an  exponential  recovery,  i.e.,  return  phase, 
from  the  point  of  maximum  closing  discontinuity  towards  the 
maximum  closure. 

A = glottal  flow  amplitude  control. 

The  equations  governing  the  model  are: 

Ug'(t)  = A eat  sin(cc^t),  for  0 < t < te  (4-4) 

and 
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Figure  4-2.  The  LF  model  of  differential  glottal  wave  (U„'(t)). 
The  corresponding  glottal  wave  (Ug(t))  is  also  presented. 
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Ug'(t)  = 


V“e>  £or  te  < t < t, 

c ta 


[e 


(4-5) 


where  = n/tp  is  the  pulse  rise  frequency  and  tc  is  the  pitch  period. 
Parameters  a and  e are  defined  for  computational  use.  The  value  of  a 
can  be  derived  by  using  the  four  basic  parameters  listed  above  and  by 
solving  the  equation 


; u '<t>  dt  * o 
o * 


(4-6) 


Similarly,  the  value  of  e can  be  derived  by  solving  the  equation 


e t_  = 1 - e 


-e(tr-t„) 


(4-7) 


The  LF  model  and  the  Fant  model  share  the  parameter  tp,  or 
c^=n/tp.  The  major  difference  between  these  two  models  is  that  the  LF 
model  allows  for  a residual  phase  of  progressive  closure,  while  the 
Fant  model  always  generates  an  abrupt  closure.  Besides,  the  Fant  model 
has  a discontinuity  at  the  flow  peak,  but  the  LF  model  is  continuous 
at  the  flow  peak.  To  relate  the  waveshape  parameters  of  these  two 
models,  we  defined  the  open  quotient  (0Q)  and  speed  quotient  (SQ)  based 
on  these  two  models: 


0Q 


open  phase 


pitch  period 


— OQp  = , for  Fant  model 

*c 

tg  + k ta 

— OQlf  = , for  LF  model 


(4-8) 


(4-9) 


and 
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t 


P 


r-  SQp  = 


, for  Fant  model  ( 4— 10 ) 


opening  phase 


SQ  = 


closing  phase 


t 


P 


>—  sqlf  = 


, for  LF  model  (4-11) 


te  + k.  ta  tp 


For  the  LF  model,  the  instant  of  "glottal  closure"  was  defined  as  the 
instant  at  which  the  glottal  flow  amplitude  drops  to  IX  of  its  peak, 
value.  Based  on  this  definition,  the  value  of  k is  a function  of  the 
parameter  ta.  Our  data  show  that  k has  values  in  the  range  of  2.0  to 
3.0  when  OX  < ta  < 10%  (k=0  when  ta=0),  where  ta  is  represented  by  a 
percentage  of  the  pitch  period,  tc. 


Based  on  the  analysis  results  presented  in  the  previous  chapter, 
four  major  factors  were  found  to  be  important  in  characterizing 
different  types  of  voice  production:  namely,  the  glottal  pulse  width, 
the  glottal  pulse  skewness,  the  abruptness  of  glottal  closure,  and  the 
turbulent  noise  component.  Here,  we  discuss  the  nature  and  extent  of 
the  variations  of  these  factors  and  their  significance  for  the 
production  of  voice  quality. 

Glottal  Waveform  Factors 

We  started  our  study  by  observing  the  relationships  between  the 
differential  glottal  waves  (Ug'(t))  and  their  corresponding  speech 
signals.  A differential  glottal  pulse,  i.e.,  the  combined  effect  of 
the  lip  radiation  and  the  glottal  volume  velocity,  represents  the 
effective  excitation  to  the  vocal  tract  (see  Chapter  3 and  equation 
(3-4)).  A typical  example  is  shown  in  Figure  4-3,  where  we  can  see 
that  the  starting  point  of  the  formant  oscillation  within  a pitch  cycle 


Glottal  Factors  for  Source  Modeling 
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(a)  Ug(t) 


Figure  4-3.  Synchronized  glottal  wave  Ug(t),  differential  glottal  wave 
Ug'(t)  (i.e.,  the  effective  excitation  wave  to  the  vocal  tract),  and 
tne  resultant  speech  waveform. 
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occurs  at  the  instant  where  the  effective  excitation  (Ug'(t))  has  its 
maximum  closing  discontinuity.  The  abrupt  change  of  the  glottal  volume 
flow  causes  the  most  significant  excitation  to  the  vocal  tract.  After 
the  main  excitation  and  during  the  closed  phase,  the  speech  waveform 
exhibits  an  exponential  decay  due  to  the  vocal  tract  damping.  The 
exponential  decay  lasts  until  the  glottal  opening,  where  the  effect  of 
some  secondary  excitations  alters  its  trend. 

As  pointed  out  in  the  previous  chapter,  for  all  the  voice  types 
under  investigation  (and  very  likely,  for  any  other  voice  type),  the 
most  prominent  excitation  pulse  of  Ug'(T)  occurs  during  the  glottal 
closing  phase.  One  distinctive  feature  that  characterizes  different 
types  of  voice  production  is  the  sharpness  of  the  main  excitation 
pulse.  In  the  aspect  of  glottal  volume-velocity  waveform  (Ug(t)),  this 
corresponds  to  the  steepness  of  the  closing  phase  and  the  abruptness 
of  glottal  closure  (see  Figure  3-8).  The  significance  of  the  glottal 
closing  phase  can  be  illustrated  in  the  frequency  domain.  Figure 
4-4  (a)  shows  several  LF  model-generated  glottal  waves,  each  with  a 
different  closing  steepness,  and  Figure  4-4  (b)  shows  the  corresponding 
spectra.  As  can  be  seen,  the  most  manifest  change  in  the  frequency 
domain  due  to  the  variations  of  closing  steepness  is  the  spectral 
slope.  The  abruptness  of  glottal  closure  is  also  related  to  the 
spectral  slope.  Fant  et  al.  [1985]  pointed  out  that  the  effect  of 
the  parameter  ta  (a  time  factor  to  control  the  abruptness  of  glottal 
closure)  in  their  LF  model  is  an  additional  first-order  low-pass 
function  of  cut-off  frequency  l/(2nta).  Figure  4-5  shows  some 
examples.  The  spectral  slope  of  a glottal  excitation  has  a great 
effect  on  the  perceived  voice  quality  and  contains  the  speaker- 
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(a)  glottal  wave 


4 3 2 1 


(b)  glottal  spectra 


Figure  4-4.  (a)  glottal  waves  with  various  closing  steepness,  (b)  the 
corresponding  spectra  (a  preemphasis  filter  1-z  was  used  to  reduce 
the  dynamic  range). 


Ill 


(a)  glottal  wave 


(b)  glottal  spectra 


Figure  4-5.  (a)  glottal  waves  with  various  abrupness  of  glottal 

closure,  (b)  the  corresponding  spectra  (a  preemphasis  filter  l-z~* 
was  used  to  reduce  the  dynamic  range). 
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specific  information,  though  it  is  of  little  significance  for  speech 
intelligibility  (phonetic  identif iabili ty) . 

The  glottal  closing  phase  comes  to  an  end  when  the  vocal  folds 
make  contact,  possibly  close  the  glottis.  During  the  closed  glottal 
interval  (the  flat  portion  in  Ug(t)),  the  excitation  no  longer  exist 
and  the  speech  pressure  wave  undergoes  an  exponential  decay  due  to  the 
vocal  tract  damping.  The  duration  of  the  glottal  closure  is  strongly 
related  to  the  specific  voice  type,  as  shown  in  the  previous  chapter. 
For  a falsetto  or  breathy  voice,  the  glottal  closure  is  short  or 
absent.  On  the  other  hand,  a vocal  fry  has  a long  closed  phase  between 
glottal  excitations,  and  is  perceived  as  a repetitive  popping  sound. 

During  the  glottal  opening  phase,  the  effective  excitation, 
Ug'(t),  is  characterized  by  a much  smoother  pulse  (see  Figure  4-3  and 
Figure  3-8),  which  introduces  secondary  excitations  to  the  vocal  tract. 
After  the  interval  of  glottal  closure,  the  secondary  excitations 
creates  new  formant  oscillations  in  the  speech  wave  (see  Figure  4-3). 
Though  not  as  significant  as  the  main  excitation  pulses,  secondary 
excitations  also  contribute  to  the  perceptual  naturalness  and  voice 
quality.  When  they  are  completely  absent,  e.g.,  in  an  impulse  excited 
LPC  synthetic  speech,  the  voice  sounds  buzzy  [Sambur  et  al.,  1978]. 

To  show  the  extent  of  glottal  waveshape  variations,  Figure  4-6 
presents  the  inverse  filtered  differential  glottal  waves  of  various 
voice  types  and  the  matching  waveforms  produced  by  the  LF  model.  An  LF 
model  matching  waveform  was  derived  by  measuring  the  initial  estimates 
of  the  amplitude  parameter  Ug'(te)  and  the  waveshape  parameters  tp,  te, 
and  ta  in  the  corresponding  pitch  period  of  the  differential  glottal 
wave,  and  then  adjusting  them  based  on  a least  squared  error  criterion 
that  minimizes  the  error  between  the  signal  and  the  matching  waveform. 
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(a)  modal  voice  (DMH3) 


(d)  vocal  fry  (CKL11) 


Figure  4-6.  The  inverse  filtered  differential  glottal  waves  and  the 
matching  waveforms  produced  by  the  LF  model. 


(e)  falsetto  (CKL7) 


(f)  falsetto  (DMH8) 


(g)  breathy  voice  (EDR2) 


(h)  breathy  voice  (JMS2) 


Figure  4-6.  (Continued) 
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The  procedure  of  deriving  an  LF  model  matching  waveform  is  presented  in 
Appendix  B.  The  measured  waveshape  parameter  values  (represented  in 
percentages  of  the  pitch  period)  and  the  corresponding  factors  for  the 
glottal  pulse  width  (OQlp)  and  the  glottal  pulse  skewness  (SQlf)  are 
listed  in  Table  4-1.  The  data  showed  that  the  glottal  pulse  width,  the 
glottal  skewness,  and  the  abruptness  of  the  glottal  closure  (controlled 
by  waveshape  parameter  ta)  are  important  factors  in  characterizing 
the  glottal  excitations  for  different  voice  types.  The  results  also 
showed  that,  for  producing  various  voice  characteristics,  these  glottal 
factors  vary  in  wide  ranges:  OQ^p  = .25  to  1.0,  SC^p  = 1.3  to  3.6,  and 
ta  = 0.5*  to  13.3*. 

Turbulent  Noise  Component 

For  a voiced  phonation,  normally,  the  glottal  volume  flow  is  in 
the  form  of  a quasi-periodic  pulse  train.  But,  when  the  glottis 
has  an  imperfect  closure  and  the  airflow  rate  is  high,  a turbulent 
airflow  is  produced.  The  critical  condition  where  the  airflow  changes 
from  laminar  to  turbulent  is  determined  by  Reynolds'  number  (Re)  as 
expressed  by  the  formula  [Flanagan  and  Ishizaka,  1976] 

p-vh  vh 

Re  = = (4-12) 

U v 

where  h = effective  width  of  the  stricture;  v = velocity  of  airflow; 
p = density  of  air;  y = coefficient  of  viscosity;  \>  = y/p,  kinetic 
viscosity.  The  effective  width,  h,  is  defined  as  4A/S,  where  A = cross 
section  area,  S = circumference  of  the  cross  section  of  the  tube. 

When  the  Reynolds'  number  exceeds  a certain  value,  that  is  the 
critical  Reynolds'  number  (Rec),  then  laminar  flow  becomes  turbulent. 


116 


Table  4-1.  Waveshape  parameters  (based  on  the  LF  model)  for  glottal 
waves  of  various  voice  types. 


Sample 

Voice  Type 

t c 

OQlf 

SQLf 

DMH3 

modal 

94 

49% 

64% 

2.1% 

.67 

2.7 

DMH16 

modal 

79 

53% 

71% 

2.5% 

.75 

2.4 

CK.L3 

modal 

65 

51% 

68% 

1.5% 

.69 

2.8 

DMH12 

slight  vocal  fry 

119 

49% 

63% 

0.8% 

.64 

3.3 

CKL11 

vocal  fry 

220 

20% 

25% 

0.5% 

.26 

3.6 

DMH8 

falsetto 

29 

57% 

77% 

13.3% 

1.00 

1.3 

CKL7 

falsetto 

47 

62% 

89% 

4.3% 

.98 

1.7 

EDR2 

breathy 

73 

48% 

68% 

6.8% 

.81 

1.5 

JMS2 

breathy 

50 

58% 

84% 

10.0% 

1.00 

1.4 

tc  (pitch  period)  in  no.  of  sample  points  (sampling  period  = .1  msec), 
tp,  te,  and  ta  in  percentage  of  the  pitch  period. 
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Isshiki  et  al.  [1978]  studied  the  glottal  turbulent  noise  by  using  a 
life-size  laryngeal  model.  They  reported  that  the  critical  Reynolds' 
number  for  their  laryngeal  model  vas  approximately  2,000.  Their 
experiments  also  showed  that  the  sound  pressure  of  the  noise  is  nearly 
proportional  to  the  square  of  the  Reynolds'  number.  Using  equation 
(4-12),  for  a circular  constriction  with  radius  r, 

42  U2  4 U2 

Re2  . — ; (4-13) 

v2  (2nr)2  v2  it  A 

and  for  a rectangular  slit  with  a long  side  of  d, 

42  U2  4 U2 

Re2  = = (4-14) 

v2  (2d)2  v2  d2 

where  U = volume  velocity.  Thus,  during  vocal  fold  vibration  the 
sound  pressure  of  the  glottal  turbulent  noise  fluctuates  due  to  the 
variations  in  airflow  and  glottal  area.  More  precisely,  the  sound 
pressure  of  the  noise  is  approximately  proportional  to  the  square  of 
the  volume  velocity  of  the  airflow,  and  inversely  proportional  to  the 
cross  sectional  area  of  the  stricture. 

Isshiki  et  al.  [1978]  also  studied  the  spectral  characteristics 
of  the  turbulent  noise.  Their  data  showed  that  the  energy  of  the 
turbulent  noise  is  distributed  over  a wide  range  of  frequencies 
(2-8  kHz),  with  some  accentuation  in  the  4 kHz  region.  We  performed 
spectral  analyses  on  our  breathy  voice  samples  and  obtained  a 
consistent  result,  showing  a high  interharmonic  noise  level  above  2 kHz 
(see  Chapter  3) . 

Though  the  existence  of  high-frequency  turbulent  noise  has  been 
shown  to  be  an  important  feature  for  breathiness  [Yanagihara,  1967; 
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Yumoto  and  Gould,  1982;  Hiraoka  et  al.,  1984;  Hurme  and  Sonninen, 
1985],  the  role  of  a turbulent  noise  source  in  synthetic  speech  has  not 
yet  been  fully  clarified.  Most  of  the  existing  glottal  source  models 
neglect  the  turbulent  noise  component.  In  this  study,  we  incorporated 
a turbulent  noise  generator  in  our  new  source  model,  which  allows  for 
variable  temporal  and  spectral  properties  for  the  turbulent  noise.  The 
perceptual  effects  of  various  noise  characteristics  will  be  reported  in 
next  chapter. 

Evaluation  of  Source  Models 

To  synthesize  natural-sounding  speech  with  desired  voice 
characteristics,  a voice  source  model  must  have  parameters  that  can 
be  controlled  that  are  important  for  perception.  Four  source  quality 
factors  (i.e.,  the  glottal  pulse  width,  the  glottal  pulse  skewness,  the 
abruptness  of  glottal  closure,  and  the  turbulent  noise  component)  were 
discussed  above.  Here,  we  evaluate  several  typical  glottal  waveform 
models  and  voice  source  models  [Rosenberg,  1971;  Fant,  1979;  Hedelin, 
1984;  Ananthapadmanabha,  1984;  Fant  et  al.,  1985;  Fujisaki  and 
Ljungqvist,  1986;  Klatt,  1987]  based  on  their  capability  to  control 
these  factors  and  the  number  of  model  parameters  they  required.  We 
summarize  the  evaluation  results  in  Table  4-2. 

All  the  source  models  under  investigation  allow  variable  glottal 
pulse  width  and  skewness.  Only  Klatt  [1987]  incorporated  a turbulent 
noise  component  into  the  source  model  for  his  Klattalk  formant 
synthesizer.  But  no  detailed  information  about  this  turbulent  noise 
generator  was  given.  Three  glottal  waveform  models  [Fant  et  al.,  1985; 
Ananthapadmanabha,  1984;  Fujisaki  and  Ljungqvist,  1986]  can  vary  the 
abruptness  of  the  glottal  closure,  while  the  others  [Rosenberg,  1971; 
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Table  4-2.  Controllable  factors  for  existing  glottal  waveform  models. 


Glottal  Waveform  Models 

ROS 

FANT  HED  ANA  LF  FL 

KLA 

Pulse  Width 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

Pulse  Skewness 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

Closure  Abruptness 

no 

no 

no 

yes 

yes 

yes 

no 

Turbulent  Noise 

no 

no 

no 

no 

no 

no 

yes 

No,  of  Parameters 

3 

3 

3 

5 

4 

7 

5 

ROS  = Rosenberg  [1971] 

FANT=  Fant  [1979] 

HED  = Hedelin  [1984] 

ANA  = Ananthapadmanabha  [1984] 

LF  = Fant  el  al.  [1985] 

FL  = Fujisaki  and  Ljungqvist  [1986] 
KLA  = Klatt  [1987] 
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Fant,  1979;  Hedelin,  1984;  Klatt,  1987]  always  generate  an  abrupt 
closure  after  the  instant  of  maximum  closing  slope  (e.g.,  see  the  Fant 
model  shown  in  Figure  4-1).  Not  one  source  model  satisfies  all  the 
four  factors  we  listed. 

The  importance  of  allowing  a progressive  closure  in  a glottal 
waveform  model  can  be  illustrated  by  the  following  example.  Figure  4-7 
shows  the  inverse  filtered  differential  glottal  wave  derived  from 
the  vowel  data  JMS2  (with  a breathy  voice  quality)  and  its  matching 
waveform  based  on  the  LF  model  [Fant  el  al.,  1985]  and  the  Fant  model 
[Fant,  1979].  As  can  be  seen,  the  LF  model  is  much  better  in 
approximating  the  natural  glottal  wave  due  to  its  capability  to  adjust 
the  abruptness  of  the  glottal  closure.  The  Fant  model  (or  any  other 
glottal  waveform  model  with  an  abrupt  glottal  closure)  is  inadequate 
for  simulating  the  glottal  wave  of  a breathy  or  falsetto  phonation 
or  during  a vowel  termination,  where  the  vocal  fold  vibration  is 
characterized  by  a progressive  closure. 

Aside  from  the  turbulent  noise  component,  the  LF  model  has  the 
capability  of  varying  the  three  glottal  waveshape  factors  in  a minimum 
number  of  model  parameters.  Figure  4-6  in  the  previous  section  shows 
the  LF  model-based  matching  waveforms  for  the  differential  glottal 
waves  of  various  voice  types,  and  illustrates  that  the  LF  model  can 
approximate  a wide  range  of  nature  glottal  excitations. 

An  Experimental  Source  Model 

A new  source  model  (Figure  4-8-(a))  was  proposed  in  this  study, 
which  consisted  of  two  components:  (1)  a glottal  pulse  generator,  which 
produces  a pitch-modulated  pulse  train,  and  (2)  a turbulent  noise 
generator.  This  source  model  includes  the  glottal  factors  which  we 
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(a)  matching  a differential  glottal  vave  ( JMS2 ) using  the  LF  model 
(waveshape  parameters  given  in  Table  4-1) 


(b)  matching  a differential  glottal  vave  ( JMS2)  using  the  Fant  model 


Figure  4-7.  An  inverse  filtered  differential  glottal  wave  ( JMS2)  and 
its  matching  waveforms  based  on  (a)  the  LF  model,  and  (b)  the  Fant 
model. 
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(a)  the  proposed  source  model 
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(b)  example  of  excitation  wave:  pulse  component  (upper  trace)  and 
noise  component  (lover  trace). 


Figure  4-8.  Block  diagram  of  the  proposed  source  model  and  an  example 
of  the  generated  excitation  wave. 
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considered  to  be  important  for  synthesizing  natural-sounding  speech 
with  various  voice  characteristics.  Using  this  source  model,  the 
perceptual  significance  of  the  glottal  factors  can  be  evaluated  by 
listening  to  the  synthetic  voice  samples. 

Glottal  pulse  generator.  We  adapted  the  LF  model  [Fant  et  al., 
1985]  for  the  glottal  pulse  generator.  As  demonstrated  above,  the  LF 
model  can  approximate  a wide  range  of  natural  glottal  waves  by  varying 
the  pulse  width,  the  pulse  skewness,  and  the  abruptness  of  glottal 
closure.  We  also  showed  that  appropriate  parameter  values  for  the  LF 
model  can  be  derived  by  using  the  inverse  filtered  differential  glottal 
waves.  The  measured  waveshape  parameter  values  for  various  voice  types 
(Table  4-1)  in  this  research  provide  a good  reference  for  using  the  LF 
model  to  synthesize  speech  with  a specific  voice  quality. 

Turbulent  noise  generator.  The  turbulent  noise  generator 
consisted  of  a random  number  generator,  a spectrum-shaping  filter,  and 
an  amplitude  modulator.  The  random  number  generator  produces  random 
noise  with  a normal  distribution  and  a flat  spectrum.  The  amplitude 
level  of  the  random  noise  was  controlled  by  a parameter  An,  which 
specifies  the  energy  ratio  between  the  noise  and  the  pulse  components. 
The  spectrum-shaping  filter  was  designed  to  simulate  the  spectral 
characteristics  of  the  glottal  turbulent  noise.  In  the  present 
research,  based  on  our  data  (see  Chapter  3)  and  that  derived  by  Isshiki 
et  al.  [1978],  a high-pass  filter  with  a cut-off  frequency  of  2 kHz 
and  a transition  band  of  500  Hz  was  used.  The  amplitude  modulator 
was  intended  to  simulate  the  amplitude  fluctuations  of  the  glottal 
turbulent  noise  due  to  the  variations  in  airflow  and  glottal  area 
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during  vocal  fold  vibration.  A pitch-modulated  square  wave  with  an 
adjustable  duty  cycle  was  used.  Two  parameters  were  used  to  control 
the  starting  position  (Tn)  and  the  duration  (Dn)  of  the  duty  cycle. 
Figure  4-8-(b)  shows  an  example  of  an  excitation  signal  with  the  pulse 
and  noise  components. 

Besides  simulating  breathiness,  the  turbulent  noise  generator 
described  above  is  also  suitable  for  producing  excitation  signals 
for  voiced  fricatives,  where  the  turbulent  noise  is  generated  at  a 
constriction  in  the  vocal  tract.  Our  observations  suggest  that  the 
amplitude  modulation  factor  is  especially  important  for  voiced 
fricatives.  For  example,  as  shown  in  Figure  4-9,  the  waveform  of 
a sample  of  the  voiced  fricative  /z/  clearly  shows  the  amplitude 
modulation  of  the  noise  component. 

In  the  next  chapter,  we  describe  the  experiments  of  perceptual 
evaluation  for  selected  glottal  factors.  The  proposed  source  model  was 
used  with  a cascade  formant  synthesizer  to  produce  speech  stimuli.  As 
will  be  shown,  this  source  model  can  generate  a wide  range  of 
excitation  waves  to  produce  various  voice  characteristics. 
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Figure  4-9.  Speech  waveform  of  a sample  of  the  voiced  fricative  /z/. 


CHAPTER  5 


PERCEPTUAL  EVALUATION  FOR  SOURCE  FACTORS 

This  chapter  describes  the  procedures  and  the  results  of  the 
experiments  of  the  perceptual  evaluation  for  selected  source  quality 
factors.  The  purpose  of  these  experiments  was  to  understand  the 
perceptual  correlates  of  the  source  factors.  The  knowledge  gained 
should  be  useful  for  synthesizing  natural-sounding  speech  and  for 
establishing  rules  for  synthesizing  speech  with  specific  voice 
characteristics.  The  stimuli  used  for  the  perceptual  evaluation  were 
synthetic  voice  samples  produced  by  the  proposed  source  model  with  a 
cascade  formant  synthesizer.  This  arrangement  assured  that  the  source 
factors  under  investigation  were  varied  in  a controlled  manner.  The 
auditory  responses  of  the  source  factors  were  judged  by  listening 
tests . 


Listening  Test 

Stimuli . The  stimuli  for  our  listening  tests  were  synthetic 
voices  produced  by  the  proposed  source  model  with  a cascade  formant 
synthesizer  [Klatt  1980].  For  each  source  factor  under  investigation, 
a group  of  synthetic  vowels  were  produced  by  progressively  changing 
an  appropriate  source  parameter  while  not  varying  the  formant 
structure.  All  the  stimuli  used  were  sustained  vowels  of  about  two 
seconds  and  their  intensities  were  adjusted  to  the  same  level. 
(However,  the  perceived  loudnesses  were  not  necessarily  the  same  since 
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different  voice  samples  may  have  different  energy  distributions  in 
frequency. ) 

Listening  Group.  The  judges  for  the  listening  test  were  three 
professors,  two  from  the  Speech  Department  and  one  from  the  Electrical 
Engineering  (EE)  Department.  The  two  speech  professors  were  also 
professional  voice  diagnosticians  and  the  EE  professor  was  an 
experienced  speech  scientist. 

Perceptual  Rating.  Three  terms  describing  different  voice 
qualities  were  used  in  the  perceptual  rating  of  this  study,  namely, 
naturalness,  breathiness,  and  hypo-/hyperfunction.  Naturalness  is 
a highly  subjective  attribute,  which  in  this  study  is  used  in  the 
sense  of  "human  sounding."  Breathiness  is  often  associated  with  an 
incomplete  glottal  closure  during  vocal  fold  vibration,  suggesting  that 
the  important  audible  component  of  the  sound  is  the  noise  that  is 
produced  by  the  escapage  of  turbulent  airflow  at  the  glottis.  The 
voice  quality  of  hypo-  or  hyperfunction  is  related  to  the  perceptual 
sensation  of  vocal  effort.  Hyperfunction  is  used  to  describe  a 
strained  or  tense  voice  quality,  as  if  the  vocal  folds  are  compressed 
during  phonation.  Hypofunction,  on  the  contrary,  is  due  to  too  little 
tension  in  the  vocal  folds,  resulting  in  an  asthenic  or  lax  voice 
quality.  These  voice  qualities  were  rated  on  a 7-point  scale,  from 
0 to  6.  For  the  naturalness  and  breathiness  scales,  a rating  of  6 
represents  a high  degree  of  the  quality  and  0 represents  an  absence 
of  the  quality.  For  the  hypo-/hyperfunction  scale,  a rating  of  6 
represents  the  extreme  hyperfunction,  0 represents  the  extreme 
hypofunction,  and  3 represents  a normal  voice  quality.  Figure  5-1 
shows  the  graphical  illustration  for  these  rating  scales. 
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Figure  5-1.  Graphical  illustration  for  the  perceptual  ratings  of  three 
voice  qualities  (naturalness,  breathiness,  and  hypo-/hyperfunction) . 
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Test  Procedure.  The  listening  tests  were  conducted  in  a 
professional  sound  room.  The  stimuli  were  presented  via  headphones. 
Each  judge  was  asked  to  give  his  perceptual  ratings  for  a group  of 
stimuli  at  a time.  There  were  nine  groups  of  stimuli,  each  consisting 
of  either  four  or  five  vowel  samples.  The  judges  were  not  told  the 
manner  by  which  the  voice  samples  were  synthesized.  They  were  given 
the  opportunity  to  listen  to  the  vowel  samples  as  many  times  as  they 
desired. 


Perceptual  Evaluation 


Glottal  Waveshape  Factors 

Three  groups  of  vowel  samples  were  synthesized  for  perceptual 
evaluation.  Each  group  was  produced  by  progressively  varying  one  of 
the  three  waveshape  parameters  of  the  LF  glottal  waveform  model:  tp 
(the  opening  time),  te  (the  instant  of  maximum  closing  rate),  and 
ta  (the  time  factor  for  controlling  the  abruptness  of  closure).  The 
detailed  waveshape  parameter  values  for  each  vowel  sample  are  listed  in 
Table  5-1.  The  corresponding  LF  model-based  open  quotients  (0QLp)  and 
speed  quotients  (SQlp)  are  also  listed  in  the  same  table.  Figure  5-2 
shows  the  source  excitation  waveforms.  And  Figure  5-3  shows  the  source 
spectra. 

During  the  listening  test,  the  judges  were  asked  to  rate  the 
degree  of  the  hypo-/hyperfunction  (i.e.,  the  lax/tense  voice  quality) 
of  the  vowel  samples.  For  each  vowel  sample,  the  perceptual  ratings 
given  by  each  individual  judge  and  the  average  ratings  are  listed 
in  Table  5-1,  along  with  the  source  parameters.  No  significant 
differences  in  the  ratings  were  found  across  the  three  judges.  One  of 
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Table  5-1.  The  waveshape  parameter  values  and  the  hypo-/hyperfunction 
ratings  for  the  testing  vowel  samples. 


PERCEPTUAL  RATINGS  OF 

WAVESHAPE  PARAMETERS  HYPO  - HYPERFUNCTION 

Sample 


No. 

tp 

'■e 

ta 

OQLf 

SQLp 

Jl 

J2 

J3 

AVG 

1-1 

58  % 

64  % 

1 X 

. 66 

7.3 

5 

4 

6 

5.0 

1-2 

50  % 

ft 

ft 

ff 

3.1 

4 

3 

4 

3.7 

1-3 

47  % 

ff 

ft 

ft 

2.5 

3 

2 

2 

2.3 

1-4 

43  % 

ff 

ff 

ft 

1.9 

2 

2 

1 

1.7 

1-5 

36  % 

ff 

ft 

ft 

1.2 

0 

1 

0 

0.3 

2-1 

50  % 

55  % 

2 X 

.59 

5.6 

4 

4 

6 

4.7 

2-2 

ft 

65  % 

ft 

.69 

2.6 

3 

2 

3 

2.7 

2-3 

ft 

75  X 

ff 

.79 

1.7 

2 

1 

2 

1.7 

2-4 

ft 

85  % 

ff 

.89 

1.3 

0 

0 

0 

0.0 

3-1 

50  % 

64  X 

0 % 

.64 

3.6 

5 

4 

5 

4.7 

3-2 

ff 

ff 

2 X 

.68 

2.8 

3 

2 

3 

2.7 

3-3 

ft 

ff 

5 % 

.76 

1.9 

2 

2 

1 

1.7 

3-4 

ff 

ft 

10  X 

.90 

1.3 

0 

1 

0 

0.3 

* Common  source  parameters:  F0=116  Hz,  no  turbulent  noise. 

* Ji  represents  judge  i,  i=l,  2,  3. 
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test  1-1 


test  1-3 


test  1-4 


test  1-5 


Figure  5-2.  The  excitation  waves  used  to  produce  vowel  samples  for 
the  perceptual  evaluation  of  hypo-/hyperfunctional  voice  quality. 
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test  3-1 


test  3-3 


Figure  5-2.  (Continued) 
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Figure  5-3.  The  corresponding  spectra  for  the  excitation  waves  shown 
in  Figure  5-2. 
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the  judges  also  reported  that,  aside  from  the  differences  in  lax/tense 
quality,  all  the  stimuli,  except  sample  2-4,  have  comparable  "vowel 
quality"  and  naturalness.  The  only  exceptional  sample  (sample  2-4) 
was  judged  to  be  extremely  hypofunctional  and  had  a degradation  in  the 
overall  quality  (sounded  somewhat  unnatural). 

Figure  5-4  shows  the  scatterplots  of  the  average  perceptual 
ratings  versus  the  open  quotients  (OQ^p)  and  the  speed  quotients  (SQ^p) 
of  the  source  excitation  waves.  These  scatterplots  together  with  the 
source  spectra  shown  in  Figure  5-3  reveal  the  correlations  between 
the  speed  quotient,  the  spectral  slope,  and  the  voice  quality  of  hypo- 
/hyperfunction.  An  excitation  wave  with  a low  speed  quotient  (i.e., 
a low  glottal  pulse  skewness  and  a smooth  glottal  closure)  produces 
a steep-falling  spectral  slope  and  results  in  hypofunctional  or  lax 
voice  quality.  On  the  contrary,  an  excitation  wave  with  a high 
speed  quotient  produces  energy  at  higher  frequencies  and  results  in 
hyperfunctional  or  tense  voice  quality.  The  vowel  samples  that  were 
judged  as  normal  in  quality  (with  average  ratings  in  the  range  of  3.0 
+ 0.7)  have  SQlf  values  from  2.5  to  3.1.  These  values  are  consistent 
with  the  results  derived  from  the  inverse  filtered  glottal  waves  given 
in  the  previous  chapter. 

In  contrast  to  the  speed  quotient,  the  open  quotient  is  not  as 
good  in  predicting  the  voice  quality  of  hypo-/hyperfunction.  For 
example,  vowel  samples  1-1  to  1-5  have  the  same  OQ^p  value  of  0.66 
but  their  hypo-/hyperfunction  ratings  vary  from  5.0  (high  degree 
of  hyperfunction)  to  0.3  (extremely  high  degree  of  hypofunct ion) . 
However,  high  open  quotients  due  to  smooth  glottal  closure  were  found 
to  be  related  to  lax  voice  quality. 
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Figure  5-4.  Scatterplots  of  the  13  vowel  samples  for  the  perceptual 
ratings  of  hypo-/hyperfunction  versus  (a)  the  speed  quotients  (SQLp) 
and  (b)  the  open  quotients  (OQlp)  of  the  source  excitations. 
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The  vowel  sample  2-4,  which  was  judged  to  be  extremely 
hypofunctional  and  with  poor  overall  quality,  was  produced  by  a source 
excitation  wave  with  highly  symmetrical  opening  and  closing  phases 
and  with  a high  open  quotient  (.88).  We  conjecture  that  the  quality 
degradation  is  due  to  the  lack  of  excitation  during  the  closing  phase. 
Figure  5-5-(a)  shows  the  synchronous  source  excitation  and  speech 
waveform  of  vowel  sample  2-4.  For  comparison,  Figure  5-5-(b)  shows  a 
normal  speech  waveform  (sample  2-2)  and  its  source  excitation.  It 
is  interesting  to  compare  sample  2-4  with  sample  3-4,  which  was  also 
judged  to  be  hypofunctional  but  with  better  overall  quality.  The 
source  excitations  of  these  two  samples  have  spectral  trends  close 
to  each  other  (see  Figure  5-3).  But  sample  3-4  has  a much  sharper 
excitation  pulse  during  the  closing  phase  (Figure  5-5-(c)),  as  observed 
in  many  nature  glottal  excitation  waves.  This  example  revealed  the 
perceptual  significance  of  the  phase  characteristics  of  the  source 
excitations.  It  also  suggested  that  it  is  important  for  a glottal 
waveform  model  to  include  one  or  more  parameters  to  control  the  closure 
abruptness  such  that  a better  approximation  for  natural  glottal  waves 
can  be  achieved. 

Turbulent  Noise  Factors 

This  section  describes  the  experiments  of  voice  synthesis  and 
perceptual  evaluation  conducted  to  study  the  factors  that  are  important 
for  natural-sounding  synthetic  breathy  voice.  Vowel  samples  were 
synthesized  by  using  the  proposed  source  model  (see  Chapter  4),  which 
consists  of  a glottal  pulse  component  and  a turbulent  noise  component. 
Several  different  temporal  and  spectral  properties  were  tested  for  the 
turbulent  noise  source. 
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(a)  test  2-4  (poor  voice  quality) 


(b)  test  2-2.  (normal  voice  quality) 


(c)  test  3-4  (hypofunctional  voice  quality) 


Figure  5-5.  The  excitation  waves  and  the  synthesized  speech  waveforms. 
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Experiment  I.  Four  types  of  turbulent  noise  were  tested  in  this 
experiment.  These  were:  (1)  continuous  noise  with  a flat  spectrum,  (2) 
noise  over  50%  duration  of  each  pitch  period  and  with  a flat  spectrum, 
(3)  continuous  high-pass  filtered  noise,  and  (4)  high-pass  filtered 
noise  over  50%  duration  of  each  pitch  period.  Three  groups  of  vowel 
samples  were  synthesized,  with  each  group  consisting  of  four  vowel 
samples  synthesized  by  four  different  types  of  noise  component  at  a 
specified  noise  level.  All  the  vowel  samples  used  the  same  glottal 
pulse,  which  has  a shape  resembling  that  derived  from  the  subject  EDR 
(a  male  patient  with  breathy  voice  quality).  Table  5-2  lists  the 
detailed  source  parameter  values  for  each  vowel  sample.  Figure  5-6 
shows  the  glottal  pulse  and  the  four  types  of  noise  component. 

During  the  listening  test,  the  judges  were  asked  to  rate  the 
naturalness  of  the  breathy  quality  for  each  vowel  sample.  The  results 
of  the  perceptual  evaluation  are  listed  in  Table  5-2,  along  with  the 
corresponding  source  parameters.  All  three  judges  showed  their 
preference  for  the  amplitude-modulated  noise  source.  One  of  the  judges 
described  the  vowel  samples  synthesized  by  the  continuous  noise  source 
as  tinny  sounding.  Two  out  of  the  three  judges  found  no  differences 
between  noise  sources  with  or  without  high-pass  filtering.  Only  one 
judge  showed  his  preference  for  high-pass  filtered  noise  source. 

The  results  suggest  that  the  amplitude-modulation  of  the  noise 
source  has  a significant  effect  for  achieving  perceptual  naturalness. 
On  the  other  hand,  the  spectral  shaping  of  the  noise  source  by  a high- 
pass  filtering  is  less  perceivable.  We  conjecture  that  the  relative 
unimportance  of  the  high-pass  filtering  for  the  noise  source  is  due  to 
the  high  harmonic-to-noise  ratio  at  low  frequencies,  i.e.,  the  effect 
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Table  5-2.  The  noise  source  parameters  and  the  naturalness  ratings 
for  the  testing  vowel  samples. 


NOISE  SOURCE  PARAMETERS  NATURALNESS  RATINGS 

Sample  Noise 

No.  Hi-Pass  Amp-Mod  Ratio  J1  J2  J3  AVG 


1-1 

no 

no 

.25  % 

0 

1 

4 

1.7 

1-2 

yes 

no 

»» 

0 

1 

5 

2.0 

1-3 

no 

yes 

tf 

2 

2 

4 

2.7 

1-4 

yes 

yes 

II 

2 

2 

5 

3.0 

2-1 

no 

no 

.50  % 

1 

1 

2 

1.3 

2-2 

yes 

no 

If 

1 

1 

3 

1.7 

2-3 

no 

yes 

It 

3 

2 

3 

2.7 

2-4 

yes 

yes 

If 

3 

2 

4 

3.0 

3-1 

no 

no 

.75  % 

0 

1 

2 

1.0 

3-2 

yes 

no 

If 

0 

1 

4 

1.7 

3-3 

no 

yes 

II 

2 

3 

3 

2.7 

3-4 

yes 

yes 

II 

3 

3 

4 

3.3 

Common  source 

parameters 

• 

F0=108  Hz, 

p=50%,  te= 

70%,  ta=7%. 

T 

*n 

=15%, 

Dn=502. 

Ji 

represents  judge  i,  i 

=1,  2,  3. 
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pulse  component 


Noise  type  (1):  continuous,  flat  spectrum. 


Noise  type  (2):  502  of  each  pitch  period,  flat  spectrum. 


Noise  type  (3):  continuous,  high-pass  filtered. 


Noise  type  (4):  50%  of  each  pitch  period,  high-pass  filtered. 


Figure  5-6.  The  pulse  component  and  four  types  of  noise  component  used 
for  the  source  excitations  in  Experiment  I. 
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of  noise  component  at  low  frequencies  is  masked  by  the  harmonics  with 
relatively  high  intensities. 

Experiment  II.  In  this  experiment,  we  compared  the  perceptual 
effect  of  the  duration  (in  a percentage  of  the  pitch  period)  of  the 
noise  component.  Five  vowel  samples  were  synthesized,  with  the  noise 
components  appearing  in  the  durations  of  0%,  25%,  50%,  75%,  and  100% 
of  the  pitch  period.  The  pulse  component  used  was  the  same  as  in 
Experiment  I.  Table  5-3  lists  the  detailed  source  parameter  values. 

As  in  Experiment  I,  the  judges  were  asked  to  rate  the  naturalness 
of  the  vowel  samples.  Table  5-3  lists  the  results  of  the  perceptual 
evaluation.  As  can  be  seen,  the  noise  source  with  50%  and  75%  of 
duty  cycle  gained  the  highest  average  score.  The  voice  sample  with 
continuing  noise  (100%  duty  cycle)  was  not  preferred,  as  in  Experiment 
I.  The  voice  sample  without  any  noise  component  (0%  duty  cycle)  had 
the  lowest  score  and  was  judged  to  have  poor  overall  vowel  quality.  We 
conjecture  that  this  is  due  to  the  low  energy  at  higher  frequencies. 
Figure  5-7  shows  the  source  spectra  of  sample  1 and  sample  3.  As  can 
be  seen,  in  sample  3 the  small  amount  of  noise  component  (noise  ratio  = 
.5%)  becomes  dominant  above  the  2 kHz  region.  This  noise  component  not 
only  causes  the  sensation  of  breathiness  but  also  boosts  the  amplitudes 
of  the  higher  formants. 

Experiment  III.  The  purpose  of  this  experiment  was  to  study  the 
perceptual  effect  of  the  location  (over  a pitch  period)  of  the  noise 
component.  Four  vowel  samples  were  synthesized,  all  with  a noise 
component  modulated  by  a square  wave  of  50%  duty  cycle  but  at  different 
location  in  each  pitch  period.  Figure  5-8  shows  the  arrangements 
of  the  noise  source.  Table  5-4  lists  the  detailed  source  parameter 


values. 
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Table  5-3.  The  noise  source  parameters  and  the  naturalness  ratings 
for  the  testing  vowel  samples. 


Sample 

No. 

NOISE 

SOURCE  PARAMETERS 

NATURALNESS  RATINGS 

T 

in 

_ Dn 

Noise 

Ratio 

Jl 

J2 

J3 

AVG 

1 

0 X 

.00  % 

1 

1 

1 

1.0 

2 

75  X 

25  X 

.50  % 

2 

3 

3 

2.7 

3 

It 

50  X 

If 

4 

3 

4 

3.7 

4 

If 

75  % 

If 

5 

2 

4 

3.7 

5 

I! 

100  X 

It 

2 

1 

3 

2.0 

* Common  source  parameters: 

F0=108  Hz,  tp=50£,  te=70%,  ta=7%,  no  high-pass  filtering. 

* Ji  represents  judge  i,  i=l,  2,  3. 
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Figure  5-7.  The  comparison  of  source  spectra  in  Experiment  II:  (a)  the 
spectral  envelope  of  sample  1,  (b)  the  FFT  spectrum  of  sample  3. 
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pulse  component 


one  pitch  period 

< > 

< > 

Dn  = 50 % 


Tn  = 25% 


W\j 

Tn  = 50% 

rfkiikHt Wvt 

Tn  = 75% 


Figure  5-8.  The  common  pulse  component  and  the  four  different  noise 
components  used  for  the  source  excitations  in  Experiment  III. 
(Different  amplitude  scales  were  used  for  the  pulse  and  noise 
components . ) 
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Table  5-4.  The  noise  source  parameters  and  the  naturalness  ratings 
for  the  testing  vowel  samples. 


NOISE  SOURCE  PARAMETERS  NATURALNESS  RATINGS 
Sample  Noise 


No. 

T 

*n 

Dn 

Ratio 

Jl 

J2 

J3 

AVG 

1 

0 % 

50  X 

.25  % 

4 

2 

2 

2.7 

2 

25  % 

ff 

ff 

2 

2 

3 

2.3 

3 

50  % 

ff 

ff 

4 

3 

3 

3.3 

4 

75  % 

ff 

ff 

4 

4 

3 

3.7 

* Common  source  parameters: 

F0=108  Hz,  tp=50%,  te=70% , ta=7%,  no  high-pass  filtering. 

* Ji  represents  judge  i,  i=l,  2,  3. 
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As  in  Experiments  I and  II,  the  judges  were  asked  to  rate  the 
naturalness  of  the  vowel  samples.  The  results  of  the  perceptual 
evaluation  are  listed  in  Table  5-4.  Though  different  judges  showed 
more  or  less  different  preferences,  each  judge  selected  sample  4 as 
most  preferred  or  one  of  the  most  preferred  samples.  As  shown  in 
Figure  5-8,  the  noise  component  of  sample  4 was  located  near  the  the 
point  of  maximum  glottal  closure.  It  is  very  likely  that  the  human 
breathy  phonations  have  similar  situations,  i.e.,  prominent  turbulent 
noise  is  produced  at  a small  glottal  opening.  The  results  of  the 
perceptual  evaluation  suggested  that,  though  the  location  of  noise 
production  is  not  critical  for  the  quality  of  synthetic  breathy  voices, 
perceptual  naturalness  is  improved  by  simulating  the  characteristics  of 
the  human  phonation. 

Experiment  IV.  The  purpose  of  this  experiment  was  to  study  the 
correlations  between  the  degree  of  perceptual  breathiness  and  the 
noise-to-harmonic  ratio  (NHR).  Four  vowel  samples  were  synthesized. 
Each  of  them  was  produced  by  a source  excitation  with  different  pulse 
shape  but  having  equal  fundamental  frequency  and  the  same  overall 
energy  ratio  between  the  noise  and  the  pulse  components  (.25 Z) . The 
noise-to-harmonic  relations  in  these  source  excitations  were  not  the 
same  due  to  their  different  harmonic  distributions  in  frequency.  Table 
5-5  lists  the  detailed  source  parameter  values.  Figure  5-9  shows  the 
source  spectra. 

During  the  listening  tests,  the  judges  were  asked  to  rate  the 
degree  of  breathiness  of  the  vowel  samples.  Table  5-5  lists  the 
perceptual  ratings.  The  results  showed  that  the  vowel  samples  differ 
greatly  in  the  degree  of  perceptual  breathiness,  in  spite  of  the  equal 
energy  ratio  between  the  noise  and  pulse  source  components.  The  degree 
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Table  5-5.  The  source  parameters,  noise-to-harmonic  ratios,  and 
breathiness  ratings  for  the  testing  vowel  samples. 


Sample 


WAVESHAPE  PARAMETERS 


N/H  RELATIONS  BREATHINESS  RATINGS 
Overall  Hi-freq 


No. 

tp 

te 

t«  OQr.p  SQtp  NHR 

NHR 

Jl 

J2 

J3 

AVG 

1 

50  X 

70  X 

7 X .81 

1.6  -26  dB 

2.9  dB 

5 

4 

5 

4.7 

2 

42  % 

64  % 

2 % .67 

1.7 

-2.1  dB 

5 

4 

4 

4.3 

3 

48  X 

64  % 

2 X .67 

2.5 

-5.1  dB 

3 

2 

2 

2.3 

4 

50  X 

64  X 

IX  .66 

3.1 

-7.6  dB 

1 

1 

0 

0.7 

* Common  source  parameters: 

F0=108  Hz,  Tn=75%,  Dn=50%,  no  high-pass  filtering. 

* Ji  represents  judge  i,  i=l,  2,  3. 
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Figure  5-9.  The  source  spectra  of  Experiment  VI. 
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of  perceptual  breathiness,  however,  was  found  to  be  strongly  related  to 
the  noise-to-harmonic  ratio  (NHR)  at  higher  frequencies.  The  vowel 
samples  with  high  NHRs  at  higher  frequencies  (sample  1 and  2)  were 
rated  being  very  breathy  (average  rating  of  breathiness  =4.7  and  4.3). 
On  the  other  hand,  the  vowel  sample  with  a low  NHR  at  higher 
frequencies  (sample  1)  was  perceived  as  being  very  low  in  breathiness 
(average  rating  of  breathiness  = 0.7),  presumably  because  the  effect  of 
the  turbulent  noise  is  masked  by  the  harmonics. 

Summary.  Our  conclusions  regarding  the  turbulent  noise  source 
include: 

(1)  The  amplitude  modulation  of  the  turbulent  noise  is  important 
for  achieving  naturalness  for  synthetic  breathy  voices.  Our 
data  suggest  that  a duty  cycle  in  the  range  of  50%  to  75%  is 
preferred . 

(2)  Though  the  location  (within  a pitch  period)  of  noise 
production  is  not  very  critical,  the  perceptual  naturalness  is 
improved  when  a noise  source  simulates  the  natural  manner  of 
the  human  phonation,  i.e.,  the  noise  source  should  be  located 
near  the  point  of  maximum  glottal  closure. 

(3)  The  high-pass  filtering  for  the  turbulent  noise  is  not 
critical  for  perceptual  breathiness  because  the  effect  of 
noise  in  the  low-frequency  region  is  masked  by  strong  harmonic 
components. 

(4)  The  degree  of  perceptual  breathiness  is  primarily  dependent 
on  the  noise-to-harmonic  ratio  at  higher  frequencies  (above 
2 KHz). 


CHAPTER  6 


CONCLUSIONS  AND  DISCUSSIONS 


Summary 

The  research  has  investigated  aspects  of  vocal  excitation 
characteristics  on  the  production  of  voice  quality.  Speech  analysis 
and  synthesis  techniques  were  developed  for  producing  natural-sounding 
synthetic  speech  with  desired  voice  characteristics.  The  research 
project  included  source-feature  extraction  from  speech  and  EGG  signals, 
source  modeling  for  speech  synthesis,  and  perceptual  evaluation  for 
selected  source  factors.  Several  voice  types  were  investigated:  modal, 
vocal  fry,  falsetto  and  breathy  voices.  The  achievements  were  as 
follows . 

Glottal  Inverse  filtering  and  glottal  wave  characteristics.  A new 
glottal  inverse  filtering  technique  (the  two-pass  method)  was  proposed. 
This  method  selects  an  analysis  interval  that  excludes  the  region  of 
the  main  excitation  pulse  to  increase  the  accuracy  of  the  estimated 
vocal  tract  transfer  function,  and  thus  the  inverse  filtered  glottal 
wave.  Using  this  technique,  the  glottal  waves  of  various  voice  types 
were  estimated  and  their  common  and  distinctive  characteristics  were 
investigated.  The  results  showed  that:  1)  for  all  the  voice  types 
under  investigation,  the  glottal  waves  exhibit  a steeper  change  of 
slopes  during  the  closing  phase  than  for  the  opening  phase;  2)  in  time 
domain,  the  glottal  waves  of  different  voice  types  are  characterized 
by  the  pulse  width,  the  pulse  skewness,  and  the  abruptness  of  glottal 
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closure;  and  3)  in  frequency  domain,  the  major  distinguishing  features 
include  the  spectral  slope  and  the  relations  between  the  fundamental 
frequency  and  the  higher  harmonics. 

Source-feature  extraction  from  speech  signals.  Source-related 
features  that  characterize  various  voice  qualities  were  extracted 
direct  from  speech  signals  (instead  of  from  the  inverse  filtered 
glottal  waves).  Parameters  were  defined  to  measure  the  general  source 
spectral  trend,  the  energy  ratio  of  the  glottal  turbulent  noise  to  the 
harmonic  component,  and  the  temporal  energy  distribution  of  the  vocal 
excitation  during  one  pitch  period. 

EGG  waveform  features.  EGG  waveforms  of  various  voice  types 
were  investigated.  The  results  showed  that  for  all  the  voice  types 
under  investigation  the  EGG  waveforms  exhibit  a steeper  change  of 
slopes  (implying  a rapid  change  of  vocal  fold  contact  areas)  during  the 
closing  phase.  This  result  was  consistent  with  that  derived  from  the 
inverse  filtered  glottal  waves.  Sharpness  factors  (SF)  defined  on  the 
EGG  waveforms  were  successfully  used  to  characterize  vocal  fold  contact 
phenomena.  Falsetto  and  breathy  phonations  were  found  to  be  related  to 
high  SF  values,  while  modal  and  vocal  fry  phonations  were  found  to  be 
related  to  low  SF  values. 

Source  model  for  speech  synthesis.  Based  on  the  analysis  results, 
four  major  factors  were  considered  to  be  important  in  characterizing 
the  glottal  excitations  for  different  voice  types:  namely,  the  glottal 
pulse  width,  the  glottal  pulse  skewness,  the  abruptness  of  glottal 
closure,  and  the  turbulent  noise  component.  The  significance  of  these 
factors  for  the  production  of  voice  quality  was  discussed.  Existing 
glottal  waveform  models  were  evaluated  based  on  their  capability  to 
control  these  factors.  Then,  an  improved  source  model  was  proposed, 
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that  consists  of  factors  that  are  important  for  auditory  perception  and 
is  capable  of  producing  synthetic  speech  with  a wide  range  of  voice 
characteristics . 

Perceptual  evaluation  of  glottal  factors.  Using  the  new  source 
model  with  a cascade  formant  synthesizer,  synthetic  voice  samples  were 
produced  by  systematically  varying  source  parameters  that  corresponds 
to  selected  glottal  factors.  Listening  tests  were  conducted  to 
evaluate  the  auditory  effects.  The  results  showed  that  the  sensation 
of  vocal  effort  is  closely  related  to  the  speed  quotient  (SQ,  in 
the  time  domain)  and  the  spectral  slope  (in  the  frequency  domain) 
of  a glottal  wave.  A high  SQ  produces  excessive  energy  at  higher 
frequencies  and  results  in  tense  or  hyperfunctional  voice  quality.  On 
the  other  hand,  a low  SQ  corresponds  to  a steep-falling  spectral  slope 
and  results  in  lax  or  hypofunctional  voice  quality.  The  results  of 
listening  tests  also  revealed  that  the  degree  of  perceptual  breathiness 
has  a strong  positive  correlation  with  the  noise-to-harmonic  ratio 
over  the  frequency  interval  above  2 KHz.  And  proper  temporal 
characteristics  of  a turbulent  noise  source  is  important  for  producing 
natural-sounding  breathy  quality. 

Directions  of  Future  Research 

The  understanding  of  the  human  vocal  excitation  is  important  for 
developing  natural-sounding  speech  synthesizers  and  reliable  speech 
recognition  systems.  The  results  derived  in  this  study  have  been  very 
encouraging,  but  there  is  still  a long  way  to  go.  Further  research  is 
suggested  in  several  areas. 

(1)  The  currently  available  methods  for  estimating  glottal  waves 
are  restrictive  in  one  way  or  another  (see  Chapter  3 and 
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[Hillman  and  Weinberg,  1981]).  It  is  worthwhile  to  develop 
a new  and  reliable  method  (a  signal  processing  technique  or 
a hardware  transducer)  that  can  be  used  with  a wide  range 
of  speakers  and  utterances.  The  ability  to  measure  glottal 
excitation  waves  is  of  potential  benefit  for  many  problems  in 
speech  science. 

(2)  In  connected  speech,  various  prosodic  patterns  are  used  to 
express  different  types  of  statements.  Thus,  an  important 
continuation  of  the  research  reported  here  is  to  study  voice 
source  dynamics  for  connected  speech.  For  example,  various 
intonation  and  stress  patterns  may  be  correlated  to  source 
parameters  other  than  fundamental  frequency  and  timing.  The 
ultimate  practical  goal  is  to  develop  a source  model  and 
synthesis  rules  that  can  produce  natural-sounding  synthetic 
speech  with  desired  voice  and  tonal  characteristics. 

(3)  Due  to  the  differences  in  physiological  structure,  and  perhaps 
social  customs,  male,  female  and  people  of  different  ages  use 
distinctive  phonation  patterns  that  result  in  various  voice 
characteristics.  The  gender  and  age  factors  are  suggested  to 
be  included  in  the  future  research  for  vocal  excitations. 

(4)  We  have  shown  that  the  knowledge  of  vocal  excitation  is  useful 
in  source  modeling  for  speech  synthesis.  It  is  believed  that 
the  knowledge  would  also  benefit  the  applications  of  speech 
recognition  and  speaker  identification.  For  example,  research 
can  be  conducted  to  study  how  to  extract  and  use  the  source 
parameters  for  a speaker  adaptation  processing  to  improve 
the  reliability  of  a speaker-independent  speech  recognition 
system. 


APPENDIX  A 


MICROPHONE  CHARACTERISTICS  AND  INVERSE  FILTERING 

In  this  experiment,  we  compared  the  inverse  filtering  results 
derived  from  speech  signals  collected  by  microphones  with  different 
low-frequency  responses.  Two  microphones  were  used:  a B&K-4133 
condenser  microphone,  which  has  an  amplitude  response  within  +2  dB  down 
to  10  Hz  and  an  essentially  linear  phase  response,  and  an  Electro-Voice 
RE-10  dynamic  cardioid  microphone,  which  cuts  off  the  low-frequency 
component  below  50  Hz.  Speech  signals  were  simultaneously  collected 
with  these  two  microphones  being  held  at  the  same  distance  (6  inch) 
from  the  speaker's  lips,  and  were  synchronously  digitized  using  two 
separate  channels  of  a DSC-200  A/D  system  (see  Chapter  2). 

Figure  A-l  shows  the  simultaneously  and  synchronously  collected 
speech  waveforms  of  a sustained  vowel  /a/.  Figure  A-2  shows  the 
corresponding  spectra,  which  reveals  that  the  two  microphones  used  have 
different  frequency  responses  at  the  low-frequency  interval.  Figure 
A-3  shows  the  corresponding  inverse  filtered  differential  glottal  waves 
and  the  integrated  glottal  waves.  The  results  showed  that  the  speech 
signals  collected  by  the  B&K-4133  microphone  can  be  used  to  recover 
the  glottal  waves  quite  well.  On  the  other  hand,  severe  distortions, 
especially  at  the  closed  glottal  phases,  appeared  in  the  glottal  waves 
estimated  from  the  speech  signals  collected  by  the  RE-10  microphone. 
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Figure  A-l.  Speech  waveforms  synchronously  collected  by  two  different 
microphones:  (a)  using  a B&K-4133  microphone,  and  (b)  using  a RE-10 
microphone. 
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(a) 


(b) 


Figure  A-2.  The  corresponding  speech  spectra  of  Figure  A-l. 


Figure  A-3.  The  inverse  filtered  differential  glottal  waves  and  the 
corresponding  integrations.  (Figure  A-3  (a)  and  (b)  correspond  to 
Figure  A-l  (a)  and  (b),  respectively.) 


APPENDIX  B 


MATCHING  A DIFFERENTIAL  GLOTTAL  WAVE  USING  THE  LF  MODEL 
AND  A LEAST  SQUARED  ERROR  CRITERION 

To  derive  an  LF  model  [Fant  et  al.,  1985]  matching  waveform  for  a 
differential  glottal  wave,  three  waveshape  parameters,  tp,  te,  and  ta, 
and  one  amplitude  parameter  Ae,  need  to  be  determined  (see  Figure  B-l 
and  Chapter  4).  Among  them,  the  determination  of  the  instant  and 
its  amplitude  Ae  is  straightforward.  To  estimate  the  other  two 
parameters,  tp  and  ta,  we  employed  a least  squared  error  criterion. 
Since  these  two  parameters  are  independent  of  each  other  (see  equations 
4-4  and  4-5),  they  may  be  reasonably  estimated  in  separate  error 
minimization  procedures. 

Using  equation  4-4,  the  first  branch  (from  0 to  te)  of  the  LF 
model  matching  waveform  was  constructed  based  on  each  candidate  value 
of  tp  (provided  by  the  user).  Then,  the  error  minimization  procedure 
computed  the  squared  error  between  each  LF  model  matching  waveform  and 
the  corresponding  branch  of  the  differential  glottal  wave.  The  final 
tp  was  the  one  that  minimizes  the  squared  error.  Likewise,  using 
equation  4-5,  the  second  branch  (from  te  to  tc)  of  the  LF  model 
matching  waveform  was  constructed  for  each  candidate  value  of  ta,  and 
the  final  ta  was  determined  by  minimizing  the  squared  error  between 
the  LF  model  matching  waveform  and  the  corresponding  branch  of  the 
differential  glottal  wave.  Figure  5-2  shows  the  block  diagram. 
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Figure  B-l.  An  inverse  filtered  differential  glottal  wave  and  the 
corresponding  matching  waveform  produced  by  the  LF  model. 
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Figure  B-2.  Block,  diagram  for  constructing  an  LF  model  matching 
waveform  for  a differential  glottal  wave. 
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